]>
Commit | Line | Data |
---|---|---|
368ed582 SF |
1 | .. |
2 | Licensed under the Apache License, Version 2.0 (the "License"); you may | |
3 | not use this file except in compliance with the License. You may obtain | |
4 | a copy of the License at | |
5 | ||
6 | http://www.apache.org/licenses/LICENSE-2.0 | |
7 | ||
8 | Unless required by applicable law or agreed to in writing, software | |
9 | distributed under the License is distributed on an "AS IS" BASIS, WITHOUT | |
10 | WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the | |
11 | License for the specific language governing permissions and limitations | |
12 | under the License. | |
13 | ||
14 | Convention for heading levels in Open vSwitch documentation: | |
15 | ||
16 | ======= Heading 0 (reserved for the title in a document) | |
17 | ------- Heading 1 | |
18 | ~~~~~~~ Heading 2 | |
19 | +++++++ Heading 3 | |
20 | ''''''' Heading 4 | |
21 | ||
22 | Avoid deeper levels because they do not render well. | |
23 | ||
24 | ================================ | |
25 | Design Decisions In Open vSwitch | |
26 | ================================ | |
27 | ||
28 | This document describes design decisions that went into implementing Open | |
29 | vSwitch. While we believe these to be reasonable decisions, it is impossible | |
30 | to predict how Open vSwitch will be used in all environments. Understanding | |
31 | assumptions made by Open vSwitch is critical to a successful deployment. The | |
32 | end of this document contains contact information that can be used to let us | |
33 | know how we can make Open vSwitch more generally useful. | |
34 | ||
35 | Asynchronous Messages | |
36 | --------------------- | |
37 | ||
38 | Over time, Open vSwitch has added many knobs that control whether a given | |
39 | controller receives OpenFlow asynchronous messages. This section describes how | |
40 | all of these features interact. | |
41 | ||
42 | First, a service controller never receives any asynchronous messages unless it | |
43 | changes its miss_send_len from the service controller default of zero in one of | |
44 | the following ways: | |
45 | ||
46 | - Sending an ``OFPT_SET_CONFIG`` message with nonzero ``miss_send_len``. | |
47 | ||
48 | - Sending any ``NXT_SET_ASYNC_CONFIG`` message: as a side effect, this message | |
49 | changes the ``miss_send_len`` to ``OFP_DEFAULT_MISS_SEND_LEN`` (128) for | |
50 | service controllers. | |
51 | ||
52 | Second, ``OFPT_FLOW_REMOVED`` and ``NXT_FLOW_REMOVED`` messages are generated | |
53 | only if the flow that was removed had the ``OFPFF_SEND_FLOW_REM`` flag set. | |
54 | ||
55 | Third, ``OFPT_PACKET_IN`` and ``NXT_PACKET_IN`` messages are sent only to | |
56 | OpenFlow controller connections that have the correct connection ID (see | |
57 | ``struct nx_controller_id`` and ``struct nx_action_controller``): | |
58 | ||
59 | - For packet-in messages generated by a ``NXAST_CONTROLLER`` action, the | |
60 | controller ID specified in the action. | |
61 | ||
62 | - For other packet-in messages, controller ID zero. (This is the default ID | |
63 | when an OpenFlow controller does not configure one.) | |
64 | ||
65 | Finally, Open vSwitch consults a per-connection table indexed by the message | |
66 | type, reason code, and current role. The following table shows how this table | |
67 | is initialized by default when an OpenFlow connection is made. An entry | |
68 | labeled ``yes`` means that the message is sent, an entry labeled ``---`` means | |
69 | that the message is suppressed. | |
70 | ||
71 | .. table:: ``OFPT_PACKET_IN`` / ``NXT_PACKET_IN`` | |
72 | ||
73 | =========================================== ======= ===== | |
74 | master/ | |
75 | message and reason code other slave | |
76 | =========================================== ======= ===== | |
77 | ``OFPR_NO_MATCH`` yes --- | |
78 | ``OFPR_ACTION`` yes --- | |
79 | ``OFPR_INVALID_TTL`` --- --- | |
80 | ``OFPR_ACTION_SET`` (OF1.4+) yes --- | |
81 | ``OFPR_GROUP`` (OF1.4+) yes --- | |
82 | =========================================== ======= ===== | |
83 | ||
84 | .. table:: ``OFPT_FLOW_REMOVED`` / ``NXT_FLOW_REMOVED`` | |
85 | ||
86 | =========================================== ======= ===== | |
87 | master/ | |
88 | message and reason code other slave | |
89 | =========================================== ======= ===== | |
90 | ``OFPRR_IDLE_TIMEOUT`` yes --- | |
91 | ``OFPRR_HARD_TIMEOUT`` yes --- | |
92 | ``OFPRR_DELETE`` yes --- | |
93 | ``OFPRR_GROUP_DELETE`` (OF1.4+) yes --- | |
94 | ``OFPRR_METER_DELETE`` (OF1.4+) yes --- | |
95 | ``OFPRR_EVICTION`` (OF1.4+) yes --- | |
96 | =========================================== ======= ===== | |
97 | ||
98 | .. table:: ``OFPT_PORT_STATUS`` | |
99 | ||
100 | =========================================== ======= ===== | |
101 | master/ | |
102 | message and reason code other slave | |
103 | =========================================== ======= ===== | |
104 | ``OFPPR_ADD`` yes yes | |
105 | ``OFPPR_DELETE`` yes yes | |
106 | ``OFPPR_MODIFY`` yes yes | |
107 | =========================================== ======= ===== | |
108 | ||
109 | .. table:: ``OFPT_ROLE_REQUEST`` / ``OFPT_ROLE_REPLY`` (OF1.4+) | |
110 | ||
111 | =========================================== ======= ===== | |
112 | master/ | |
113 | message and reason code other slave | |
114 | =========================================== ======= ===== | |
115 | ``OFPCRR_MASTER_REQUEST`` --- --- | |
116 | ``OFPCRR_CONFIG`` --- --- | |
117 | ``OFPCRR_EXPERIMENTER`` --- --- | |
118 | =========================================== ======= ===== | |
119 | ||
120 | .. table:: ``OFPT_TABLE_STATUS`` (OF1.4+) | |
121 | ||
122 | =========================================== ======= ===== | |
123 | master/ | |
124 | message and reason code other slave | |
125 | =========================================== ======= ===== | |
126 | ``OFPTR_VACANCY_DOWN`` --- --- | |
127 | ``OFPTR_VACANCY_UP`` --- --- | |
128 | =========================================== ======= ===== | |
129 | ||
130 | ||
131 | .. table:: ``OFPT_REQUESTFORWARD`` (OF1.4+) | |
132 | ||
133 | =========================================== ======= ===== | |
134 | master/ | |
135 | message and reason code other slave | |
136 | =========================================== ======= ===== | |
137 | ``OFPRFR_GROUP_MOD`` --- --- | |
138 | ``OFPRFR_METER_MOD`` --- --- | |
139 | =========================================== ======= ===== | |
140 | ||
141 | The ``NXT_SET_ASYNC_CONFIG`` message directly sets all of the values in this | |
142 | table for the current connection. The ``OFPC_INVALID_TTL_TO_CONTROLLER`` bit | |
143 | in the ``OFPT_SET_CONFIG`` message controls the setting for | |
144 | ``OFPR_INVALID_TTL`` for the "master" role. | |
145 | ||
146 | ``OFPAT_ENQUEUE`` | |
147 | ----------------- | |
148 | ||
149 | The OpenFlow 1.0 specification requires the output port of the | |
150 | ``OFPAT_ENQUEUE`` action to "refer to a valid physical port (i.e. < | |
151 | ``OFPP_MAX``) or ``OFPP_IN_PORT``". Although ``OFPP_LOCAL`` is not less than | |
152 | ``OFPP_MAX``, it is an 'internal' port which can have QoS applied to it in | |
153 | Linux. Since we allow the ``OFPAT_ENQUEUE`` to apply to 'internal' ports whose | |
154 | port numbers are less than ``OFPP_MAX``, we interpret ``OFPP_LOCAL`` as a | |
155 | physical port and support ``OFPAT_ENQUEUE`` on it as well. | |
156 | ||
157 | ``OFPT_FLOW_MOD`` | |
158 | ----------------- | |
159 | ||
160 | The OpenFlow specification for the behavior of ``OFPT_FLOW_MOD`` is confusing. | |
161 | The following tables summarize the Open vSwitch implementation of its behavior | |
162 | in the following categories: | |
163 | ||
164 | "match on priority" | |
165 | Whether the ``flow_mod`` acts only on flows whose priority matches that | |
166 | included in the ``flow_mod`` message. | |
167 | ||
168 | "match on out_port" | |
169 | Whether the ``flow_mod`` acts only on flows that output to the out_port | |
170 | included in the flow_mod message (if out_port is not ``OFPP_NONE``). | |
171 | OpenFlow 1.1 and later have a similar feature (not listed separately here) | |
172 | for ``out_group``. | |
173 | ||
174 | "match on flow_cookie": | |
175 | Whether the ``flow_mod`` acts only on flows whose ``flow_cookie`` matches an | |
176 | optional controller-specified value and mask. | |
177 | ||
178 | "updates flow_cookie": | |
179 | Whether the ``flow_mod`` changes the ``flow_cookie`` of the flow or flows | |
180 | that it matches to the ``flow_cookie`` included in the flow_mod message. | |
181 | ||
182 | "updates ``OFPFF_`` flags": | |
183 | Whether the flow_mod changes the ``OFPFF_SEND_FLOW_REM`` flag of the flow or | |
184 | flows that it matches to the setting included in the flags of the flow_mod | |
185 | message. | |
186 | ||
187 | "honors ``OFPFF_CHECK_OVERLAP``": | |
188 | Whether the ``OFPFF_CHECK_OVERLAP`` flag in the flow_mod is significant. | |
189 | ||
190 | "updates ``idle_timeout``" and "updates ``hard_timeout``": | |
191 | Whether the ``idle_timeout`` and hard_timeout in the ``flow_mod``, | |
192 | respectively, have an effect on the flow or flows matched by the | |
193 | ``flow_mod``. | |
194 | ||
195 | "updates idle timer": | |
196 | Whether the ``flow_mod`` resets the per-flow timer that measures how long a | |
197 | flow has been idle. | |
198 | ||
199 | "updates hard timer": | |
200 | Whether the ``flow_mod`` resets the per-flow timer that measures how long it | |
201 | has been since a flow was modified. | |
202 | ||
203 | "zeros counters": | |
204 | Whether the ``flow_mod`` resets per-flow packet and byte counters to zero. | |
205 | ||
206 | "may add a new flow": | |
207 | Whether the ``flow_mod`` may add a new flow to the flow table. (Obviously | |
208 | this is always true for "add" commands but in some OpenFlow versions "modify" | |
209 | and "modify-strict" can also add new flows.) | |
210 | ||
211 | "sends ``flow_removed`` message": | |
212 | Whether the flow_mod generates a flow_removed message for the flow or flows | |
213 | that it affects. | |
214 | ||
215 | An entry labeled ``yes`` means that the flow mod type does have the indicated | |
216 | behavior, ``---`` means that it does not, an empty cell means that the property | |
217 | is not applicable, and other values are explained below the table. | |
218 | ||
219 | OpenFlow 1.0 | |
220 | ~~~~~~~~~~~~ | |
221 | ||
222 | ================================ === ====== ====== ====== ====== | |
223 | MODIFY DELETE | |
224 | RULE ADD MODIFY STRICT DELETE STRICT | |
225 | ================================ === ====== ====== ====== ====== | |
226 | match on ``priority`` yes --- yes --- yes | |
227 | match on ``out_port`` --- --- --- yes yes | |
228 | match on ``flow_cookie`` --- --- --- --- --- | |
229 | match on ``table_id`` --- --- --- --- --- | |
230 | controller chooses ``table_id`` --- --- --- | |
231 | updates ``flow_cookie`` yes yes yes | |
232 | updates ``OFPFF_SEND_FLOW_REM`` yes + + | |
233 | honors ``OFPFF_CHECK_OVERLAP`` yes + + | |
234 | updates ``idle_timeout`` yes + + | |
235 | updates ``hard_timeout`` yes + + | |
236 | resets idle timer yes + + | |
237 | resets hard timer yes yes yes | |
238 | zeros counters yes + + | |
239 | may add a new flow yes yes yes | |
240 | sends ``flow_removed`` message --- --- --- % % | |
241 | ================================ === ====== ====== ====== ====== | |
242 | ||
243 | where: | |
244 | ||
245 | ``+`` | |
246 | "modify" and "modify-strict" only take these actions when they create a new | |
247 | flow, not when they update an existing flow. | |
248 | ||
249 | ``%`` | |
250 | "delete" and "delete_strict" generates a flow_removed message if the deleted | |
251 | flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set. (Each controller | |
252 | can separately control whether it wants to receive the generated messages.) | |
253 | ||
254 | OpenFlow 1.1 | |
255 | ~~~~~~~~~~~~ | |
256 | ||
257 | OpenFlow 1.1 makes these changes: | |
258 | ||
259 | - The controller now must specify the ``table_id`` of the flow match searched | |
260 | and into which a flow may be inserted. Behavior for a ``table_id`` of 255 is | |
261 | undefined. | |
262 | ||
263 | - A ``flow_mod``, except an "add", can now match on the ``flow_cookie``. | |
264 | ||
265 | - When a ``flow_mod`` matches on the ``flow_cookie``, "modify" and | |
266 | "modify-strict" never insert a new flow. | |
267 | ||
268 | ================================ === ====== ====== ====== ====== | |
269 | MODIFY DELETE | |
270 | RULE ADD MODIFY STRICT DELETE STRICT | |
271 | ================================ === ====== ====== ====== ====== | |
272 | match on ``priority`` yes --- yes --- yes | |
273 | match on ``out_port`` --- --- --- yes yes | |
274 | match on ``flow_cookie`` --- yes yes yes yes | |
275 | match on ``table_id`` yes yes yes yes yes | |
276 | controller chooses ``table_id`` yes yes yes | |
277 | updates ``flow_cookie`` yes --- --- | |
278 | updates ``OFPFF_SEND_FLOW_REM`` yes + + | |
279 | honors ``OFPFF_CHECK_OVERLAP`` yes + + | |
280 | updates ``idle_timeout`` yes + + | |
281 | updates ``hard_timeout`` yes + + | |
282 | resets idle timer yes + + | |
283 | resets hard timer yes yes yes | |
284 | zeros counters yes + + | |
285 | may add a new flow yes # # | |
286 | sends ``flow_removed`` message --- --- --- % % | |
287 | ================================ === ====== ====== ====== ====== | |
288 | ||
289 | where: | |
290 | ||
291 | ``+`` | |
292 | "modify" and "modify-strict" only take these actions when they create a new | |
293 | flow, not when they update an existing flow. | |
294 | ||
295 | ``%`` | |
296 | "delete" and "delete_strict" generates a flow_removed message if the deleted | |
297 | flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set. (Each controller | |
298 | can separately control whether it wants to receive the generated messages.) | |
299 | ||
300 | ``#`` | |
301 | "modify" and "modify-strict" only add a new flow if the flow_mod does not | |
302 | match on any bits of the flow cookie | |
303 | ||
304 | OpenFlow 1.2 | |
305 | ~~~~~~~~~~~~ | |
306 | ||
307 | OpenFlow 1.2 makes these changes: | |
308 | ||
309 | - Only "add" commands ever add flows, "modify" and "modify-strict" never do. | |
310 | ||
311 | - A new flag ``OFPFF_RESET_COUNTS`` now controls whether "modify" and | |
312 | "modify-strict" reset counters, whereas previously they never reset counters | |
313 | (except when they inserted a new flow). | |
314 | ||
315 | ================================ === ====== ====== ====== ====== | |
316 | MODIFY DELETE | |
317 | RULE ADD MODIFY STRICT DELETE STRICT | |
318 | ================================ === ====== ====== ====== ====== | |
319 | match on ``priority`` yes --- yes --- yes | |
320 | match on ``out_port`` --- --- --- yes yes | |
321 | match on ``flow_cookie`` --- yes yes yes yes | |
322 | match on ``table_id`` yes yes yes yes yes | |
323 | controller chooses ``table_id`` yes yes yes | |
324 | updates ``flow_cookie`` yes --- --- | |
325 | updates ``OFPFF_SEND_FLOW_REM`` yes --- --- | |
326 | honors ``OFPFF_CHECK_OVERLAP`` yes --- --- | |
327 | updates ``idle_timeout`` yes --- --- | |
328 | updates ``hard_timeout`` yes --- --- | |
329 | resets idle timer yes --- --- | |
330 | resets hard timer yes yes yes | |
331 | zeros counters yes & & | |
332 | may add a new flow yes --- --- | |
333 | sends ``flow_removed`` message --- --- --- % % | |
334 | ================================ === ====== ====== ====== ====== | |
335 | ||
336 | ``%`` | |
337 | "delete" and "delete_strict" generates a flow_removed message if the deleted | |
338 | flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set. (Each controller | |
339 | can separately control whether it wants to receive the generated messages.) | |
340 | ||
341 | ``&`` | |
342 | "modify" and "modify-strict" reset counters if the ``OFPFF_RESET_COUNTS`` | |
343 | flag is specified. | |
344 | ||
345 | OpenFlow 1.3 | |
346 | ~~~~~~~~~~~~ | |
347 | ||
348 | OpenFlow 1.3 makes these changes: | |
349 | ||
350 | - Behavior for a table_id of 255 is now defined, for "delete" and | |
351 | "delete-strict" commands, as meaning to delete from all tables. A table_id | |
352 | of 255 is now explicitly invalid for other commands. | |
353 | ||
354 | - New flags ``OFPFF_NO_PKT_COUNTS`` and ``OFPFF_NO_BYT_COUNTS`` for "add" | |
355 | operations. | |
356 | ||
357 | The table for 1.3 is the same as the one shown above for 1.2. | |
358 | ||
359 | OpenFlow 1.4 | |
360 | ~~~~~~~~~~~~ | |
361 | ||
362 | OpenFlow 1.4 makes these changes: | |
363 | ||
364 | - Adds the "importance" field to ``flow_mods``, but it does not explicitly | |
365 | specify which kinds of ``flow_mods`` set the importance. For consistency, | |
366 | Open vSwitch uses the same rule for importance as for ``idle_timeout`` and | |
367 | ``hard_timeout``, that is, only an "ADD" flow_mod sets the importance. (This | |
368 | issue has been filed with the ONF as EXT-496.) | |
369 | ||
370 | .. TODO(stephenfin) Link to EXT-496 | |
371 | ||
372 | - Eviction Mechanism to automatically delete entries of lower importance to | |
373 | make space for newer entries. | |
374 | ||
375 | OpenFlow 1.4 Bundles | |
376 | -------------------- | |
377 | ||
378 | Open vSwitch makes all flow table modifications atomically, i.e., any datapath | |
379 | packet only sees flow table configurations either before or after any change | |
380 | made by any ``flow_mod``. For example, if a controller removes all flows with | |
381 | a single OpenFlow ``flow_mod``, no packet sees an intermediate version of the | |
382 | OpenFlow pipeline where only some of the flows have been deleted. | |
383 | ||
384 | It should be noted that Open vSwitch caches datapath flows, and that the cached | |
385 | flows are *NOT* flushed immediately when a flow table changes. Instead, the | |
386 | datapath flows are revalidated against the new flow table as soon as possible, | |
387 | and usually within one second of the modification. This design amortizes the | |
388 | cost of datapath cache flushing across multiple flow table changes, and has a | |
389 | significant performance effect during simultaneous heavy flow table churn and | |
390 | high traffic load. This means that different cached datapath flows may have | |
391 | been computed based on a different flow table configurations, but each of the | |
392 | datapath flows is guaranteed to have been computed over a coherent view of the | |
393 | flow tables, as described above. | |
394 | ||
395 | With OpenFlow 1.4 bundles this atomicity can be extended across an arbitrary | |
396 | set of ``flow_mod``. Bundles are supported for ``flow_mod`` and port_mod | |
397 | messages only. For ``flow_mod``, both ``atomic`` and ``ordered`` bundle flags | |
398 | are trivially supported, as all bundled messages are executed in the order they | |
399 | were added and all flow table modifications are now atomic to the datapath. | |
400 | Port mods may not appear in atomic bundles, as port status modifications are | |
401 | not atomic. | |
402 | ||
403 | To support bundles, ovs-ofctl has a ``--bundle`` option that makes the | |
404 | flow mod commands (``add-flow``, ``add-flows``, ``mod-flows``, ``del-flows``, | |
405 | and ``replace-flows``) use an OpenFlow 1.4 bundle to operate the | |
406 | modifications as a single atomic transaction. If any of the flow mods | |
407 | in a transaction fail, none of them are executed. All flow mods in a | |
408 | bundle appear to datapath lookups simultaneously. | |
409 | ||
410 | Furthermore, ovs-ofctl ``add-flow`` and ``add-flows`` commands now accept | |
411 | arbitrary flow mods as an input by allowing the flow specification to | |
412 | start with an explicit ``add``, ``modify``, ``modify_strict``, ``delete``, or | |
413 | ``delete_strict`` keyword. A missing keyword is treated as ``add``, so | |
414 | this is fully backwards compatible. With the new ``--bundle`` option | |
415 | all the flow mods are executed as a single atomic transaction using an | |
416 | OpenFlow 1.4 bundle. Without the ``--bundle`` option the flow mods are | |
417 | executed in order up to the first failing ``flow_mod``, and in case of an | |
418 | error the earlier successful ``flow_mod`` calls are not rolled back. | |
419 | ||
420 | ``OFPT_PACKET_IN`` | |
421 | ------------------ | |
422 | ||
423 | The OpenFlow 1.1 specification for ``OFPT_PACKET_IN`` is confusing. The | |
424 | definition in OF1.1 ``openflow.h`` is[*]: | |
425 | ||
426 | :: | |
427 | ||
428 | /* Packet received on port (datapath -> controller). */ | |
429 | struct ofp_packet_in { | |
430 | struct ofp_header header; | |
431 | uint32_t buffer_id; /* ID assigned by datapath. */ | |
432 | uint32_t in_port; /* Port on which frame was received. */ | |
433 | uint32_t in_phy_port; /* Physical Port on which frame was received. */ | |
434 | uint16_t total_len; /* Full length of frame. */ | |
435 | uint8_t reason; /* Reason packet is being sent (one of OFPR_*) */ | |
436 | uint8_t table_id; /* ID of the table that was looked up */ | |
437 | uint8_t data[0]; /* Ethernet frame, halfway through 32-bit word, | |
438 | so the IP header is 32-bit aligned. The | |
439 | amount of data is inferred from the length | |
440 | field in the header. Because of padding, | |
441 | offsetof(struct ofp_packet_in, data) == | |
442 | sizeof(struct ofp_packet_in) - 2. */ | |
443 | }; | |
444 | OFP_ASSERT(sizeof(struct ofp_packet_in) == 24); | |
445 | ||
446 | The confusing part is the comment on the ``data[]`` member. This comment is a | |
447 | leftover from OF1.0 ``openflow.h``, in which the comment was correct: | |
448 | ``sizeof(struct ofp_packet_in)`` is 20 in OF1.0 and ``ffsetof(struct | |
449 | ofp_packet_in, data)`` is 18. When OF1.1 was written, the structure members | |
450 | were changed but the comment was carelessly not updated, and the comment became | |
451 | wrong: ``sizeof(struct ofp_packet_in)`` and offsetof(struct ofp_packet_in, | |
452 | data) are both 24 in OF1.1. | |
453 | ||
454 | That leaves the question of how to implement ``ofp_packet_in`` in OF1.1. The | |
455 | OpenFlow reference implementation for OF1.1 does not include any padding, that | |
456 | is, the first byte of the encapsulated frame immediately follows the | |
457 | ``table_id`` member without a gap. Open vSwitch therefore implements it the | |
458 | same way for compatibility. | |
459 | ||
460 | For an earlier discussion, please see the thread archived at: | |
461 | https://mailman.stanford.edu/pipermail/openflow-discuss/2011-August/002604.html | |
462 | ||
463 | [*] The quoted definition is directly from OF1.1. Definitions used inside OVS | |
464 | omit the 8-byte ``ofp_header`` members, so the sizes in this discussion are | |
465 | 8 bytes larger than those declared in OVS header files. | |
466 | ||
467 | VLAN Matching | |
468 | ------------- | |
469 | ||
470 | The 802.1Q VLAN header causes more trouble than any other 4 bytes in | |
471 | networking. More specifically, three versions of OpenFlow and Open vSwitch | |
472 | have among them four different ways to match the contents and presence of the | |
473 | VLAN header. The following table describes how each version works. | |
474 | ||
475 | ======== ============= =============== =============== ================ | |
476 | Match NXM OF1.0 OF1.1 OF1.2 | |
477 | ======== ============= =============== =============== ================ | |
478 | ``[1]`` ``0000/0000`` ``????/1,??/?`` ``????/1,??/?`` ``0000/0000,--`` | |
479 | ``[2]`` ``0000/ffff`` ``ffff/0,??/?`` ``ffff/0,??/?`` ``0000/ffff,--`` | |
480 | ``[3]`` ``1xxx/1fff`` ``0xxx/0,??/1`` ``0xxx/0,??/1`` ``1xxx/ffff,--`` | |
481 | ``[4]`` ``z000/f000`` ``????/1,0y/0`` ``fffe/0,0y/0`` ``1000/1000,0y`` | |
482 | ``[5]`` ``zxxx/ffff`` ``0xxx/0,0y/0`` ``0xxx/0,0y/0`` ``1xxx/ffff,0y`` | |
483 | ``[6]`` ``0000/0fff`` ``<none>`` ``<none>`` ``<none>`` | |
484 | ``[7]`` ``0000/f000`` ``<none>`` ``<none>`` ``<none>`` | |
485 | ``[8]`` ``0000/efff`` ``<none>`` ``<none>`` ``<none>`` | |
486 | ``[9]`` ``1001/1001`` ``<none>`` ``<none>`` ``1001/1001,--`` | |
487 | ``[10]`` ``3000/3000`` ``<none>`` ``<none>`` ``<none>`` | |
488 | ``[11]`` ``1000/1000`` ``<none>`` ``fffe/0,??/1`` ``1000/1000,--`` | |
489 | ======== ============= =============== =============== ================ | |
490 | ||
491 | where: | |
492 | ||
493 | Match: | |
494 | See the list below. | |
495 | ||
496 | NXM: | |
497 | ``xxxx/yyyy`` means ``NXM_OF_VLAN_TCI_W`` with value ``xxxx`` and mask | |
498 | ``yyyy``. A mask of ``0000`` is equivalent to omitting | |
499 | ``NXM_OF_VLAN_TCI(_W)``, a mask of ``ffff`` is equivalent to | |
500 | ``NXM_OF_VLAN_TCI``. | |
501 | ||
502 | OF1.0, OF1.1: | |
503 | ``wwww/x,yy/z`` means ``dl_vlan`` ``wwww``, ``OFPFW_DL_VLAN`` ``x``, | |
504 | ``dl_vlan_pcp`` ``yy``, and ``OFPFW_DL_VLAN_PCP`` ``z``. If | |
505 | ``OFPFW_DL_VLAN`` or ``OFPFW_DL_VLAN_PCP`` is 1, the corresponding field | |
506 | value is wildcarded, otherwise it is matched. ``?`` means that the given | |
507 | bits are ignored (their conventional values are ``0000/x,00/0`` in OF1.0, | |
508 | ``0000/x,00/1`` in OF1.1; ``x`` is never ignored). ``<none>`` means that the | |
509 | given match is not supported. | |
510 | ||
511 | OF1.2: | |
512 | ``xxxx/yyyy,zz`` means ``OXM_OF_VLAN_VID_W`` with value ``xxxx`` and mask | |
513 | ``yyyy``, and ``OXM_OF_VLAN_PCP`` (which is not maskable) with value ``zz``. | |
514 | A mask of ``0000`` is equivalent to omitting ``OXM_OF_VLAN_VID(_W)``, a mask | |
515 | of ``ffff`` is equivalent to ``OXM_OF_VLAN_VID``. ``--`` means that | |
516 | ``OXM_OF_VLAN_PCP`` is omitted. ``<none>`` means that the given match is not | |
517 | supported. | |
518 | ||
519 | The matches are: | |
520 | ||
521 | ``[1]``: | |
522 | Matches any packet, that is, one without an 802.1Q header or with an 802.1Q | |
523 | header with any TCI value. | |
524 | ||
525 | ``[2]`` | |
526 | Matches only packets without an 802.1Q header. | |
527 | ||
528 | NXM: | |
529 | Any match with ``vlan_tci == 0`` and ``(vlan_tci_mask & 0x1000) != 0`` is | |
530 | equivalent to the one listed in the table. | |
531 | ||
532 | OF1.0: | |
533 | The spec doesn't define behavior if ``dl_vlan`` is set to ``0xffff`` and | |
534 | ``OFPFW_DL_VLAN_PCP`` is not set. | |
535 | ||
536 | OF1.1: | |
537 | The spec says explicitly to ignore ``dl_vlan_pcp`` when ``dl_vlan`` is set | |
538 | to ``0xffff``. | |
539 | ||
540 | OF1.2: | |
541 | The spec doesn't say what should happen if ``vlan_vid == 0`` and | |
542 | ``(vlan_vid_mask & 0x1000) != 0`` but ``vlan_vid_mask != 0x1000``, but it | |
543 | would be straightforward to also interpret as ``[2]``. | |
544 | ||
545 | ``[3]`` | |
546 | Matches only packets that have an 802.1Q header with VID ``xxx`` (and any | |
547 | PCP). | |
548 | ||
549 | ``[4]`` | |
550 | Matches only packets that have an 802.1Q header with PCP ``y`` (and any VID). | |
551 | ||
552 | NXM: | |
553 | ``z`` is ``(y << 1) | 1``. | |
554 | ||
555 | OF1.0: | |
556 | The spec isn't very clear, but OVS implements it this way. | |
557 | ||
558 | OF1.2: | |
559 | Presumably other masks such that ``(vlan_vid_mask & 0x1fff) == 0x1000`` | |
560 | would also work, but the spec doesn't define their behavior. | |
561 | ||
562 | ``[5]`` | |
563 | Matches only packets that have an 802.1Q header with VID ``xxx`` and PCP | |
564 | ``y``. | |
565 | ||
566 | NXM: | |
567 | ``z`` is ``((y << 1) | 1)``. | |
568 | ||
569 | OF1.2: | |
570 | Presumably other masks such that ``(vlan_vid_mask & 0x1fff) == 0x1fff`` | |
571 | would also work. | |
572 | ||
573 | ``[6]`` | |
574 | Matches packets with no 802.1Q header or with an 802.1Q header with a VID of | |
575 | 0. Only possible with NXM. | |
576 | ||
577 | ``[7]`` | |
578 | Matches packets with no 802.1Q header or with an 802.1Q header with a PCP of | |
579 | 0. Only possible with NXM. | |
580 | ||
581 | ``[8]`` | |
582 | Matches packets with no 802.1Q header or with an 802.1Q header with both VID | |
583 | and PCP of 0. Only possible with NXM. | |
584 | ||
585 | ``[9]`` | |
586 | Matches only packets that have an 802.1Q header with an odd-numbered VID (and | |
587 | any PCP). Only possible with NXM and OF1.2. (This is just an example; one | |
588 | can match on any desired VID bit pattern.) | |
589 | ||
590 | ``[10]`` | |
591 | Matches only packets that have an 802.1Q header with an odd-numbered PCP (and | |
592 | any VID). Only possible with NXM. (This is just an example; one can match | |
593 | on any desired VID bit pattern.) | |
594 | ||
595 | ``[11]`` | |
596 | Matches any packet with an 802.1Q header, regardless of VID or PCP. | |
597 | ||
598 | Additional notes: | |
599 | ||
600 | OF1.2: | |
601 | The top three bits of ``OXM_OF_VLAN_VID`` are fixed to zero, so bits 13, 14, | |
602 | and 15 in the masks listed in the table may be set to arbitrary values, as | |
603 | long as the corresponding value bits are also zero. The suggested ``ffff`` | |
604 | mask for [2], [3], and [5] allows a shorter OXM representation (the mask is | |
605 | omitted) than the minimal ``1fff`` mask. | |
606 | ||
607 | Flow Cookies | |
608 | ------------ | |
609 | ||
610 | OpenFlow 1.0 and later versions have the concept of a "flow cookie", which is a | |
611 | 64-bit integer value attached to each flow. The treatment of the flow cookie | |
612 | has varied greatly across OpenFlow versions, however. | |
613 | ||
614 | In OpenFlow 1.0: | |
615 | ||
616 | - ``OFPFC_ADD`` set the cookie in the flow that it added. | |
617 | ||
618 | - ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` updated the cookie for the flow | |
619 | or flows that it modified. | |
620 | ||
621 | - ``OFPST_FLOW`` messages included the flow cookie. | |
622 | ||
623 | - ``OFPT_FLOW_REMOVED`` messages reported the cookie of the flow that was | |
624 | removed. | |
625 | ||
626 | OpenFlow 1.1 made the following changes: | |
627 | ||
628 | - Flow mod operations ``OFPFC_MODIFY``, ``OFPFC_MODIFY_STRICT``, | |
629 | ``OFPFC_DELETE``, and ``OFPFC_DELETE_STRICT``, plus flow stats requests and | |
630 | aggregate stats requests, gained the ability to match on flow cookies with an | |
631 | arbitrary mask. | |
632 | ||
633 | - ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` were changed to add a new flow, | |
634 | in the case of no match, only if the flow table modification operation did | |
635 | not match on the cookie field. (In OpenFlow 1.0, modify operations always | |
636 | added a new flow when there was no match.) | |
637 | ||
638 | - ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` no longer updated flow cookies. | |
639 | ||
640 | OpenFlow 1.2 made the following changes: | |
641 | ||
642 | - ``OFPC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` were changed to never add a new | |
643 | flow, regardless of whether the flow cookie was used for matching. | |
644 | ||
645 | Open vSwitch support for OpenFlow 1.0 implements the OpenFlow 1.0 behavior with | |
646 | the following extensions: | |
647 | ||
648 | - An NXM extension field ``NXM_NX_COOKIE(_W)`` allows the NXM versions of | |
649 | ``OFPFC_MODIFY``, ``OFPFC_MODIFY_STRICT``, ``OFPFC_DELETE``, and | |
650 | ``OFPFC_DELETE_STRICT`` ``flow_mod`` calls, plus flow stats requests and | |
651 | aggregate stats requests, to match on flow cookies with arbitrary masks. | |
652 | This is much like the equivalent OpenFlow 1.1 feature. | |
653 | ||
654 | - Like OpenFlow 1.1, ``OFPC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` add a new flow | |
655 | if there is no match and the mask is zero (or not given). | |
656 | ||
657 | - The ``cookie`` field in ``OFPT_FLOW_MOD`` and ``NXT_FLOW_MOD`` messages is | |
658 | used as the cookie value for ``OFPFC_ADD`` commands, as described in OpenFlow | |
659 | 1.0. For ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` commands, the | |
660 | ``cookie`` field is used as a new cookie for flows that match unless it is | |
661 | ``UINT64_MAX``, in which case the flow's cookie is not updated. | |
662 | ||
663 | - ``NXT_PACKET_IN`` (the Nicira extended version of ``OFPT_PACKET_IN``) reports | |
664 | the cookie of the rule that generated the packet, or all-1-bits if no rule | |
665 | generated the packet. (Older versions of OVS used all-0-bits instead of | |
666 | all-1-bits.) | |
667 | ||
668 | The following table shows the handling of different protocols when receiving | |
669 | ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` messages. A mask of 0 indicates | |
670 | either an explicit mask of zero or an implicit one by not specifying the | |
671 | ``NXM_NX_COOKIE(_W)`` field. | |
672 | ||
673 | ============== ====== ====== ============= ============= | |
674 | Match Update Add on miss Add on miss | |
675 | cookie cookie mask!=0 mask==0 | |
676 | ============== ====== ====== ============= ============= | |
677 | OpenFlow 1.0 no yes (add on miss) (add on miss) | |
678 | OpenFlow 1.1 yes no no yes | |
679 | OpenFlow 1.2 yes no no no | |
680 | NXM yes yes\* no yes | |
681 | ============== ====== ====== ============= ============= | |
682 | ||
683 | \* Updates the flow's cookie unless the ``cookie`` field is ``UINT64_MAX``. | |
684 | ||
685 | Multiple Table Support | |
686 | ---------------------- | |
687 | ||
688 | OpenFlow 1.0 has only rudimentary support for multiple flow tables. Notably, | |
689 | OpenFlow 1.0 does not allow the controller to specify the flow table to which a | |
690 | flow is to be added. Open vSwitch adds an extension for this purpose, which is | |
691 | enabled on a per-OpenFlow connection basis using the ``NXT_FLOW_MOD_TABLE_ID`` | |
692 | message. When the extension is enabled, the upper 8 bits of the ``command`` | |
693 | member in an ``OFPT_FLOW_MOD`` or ``NXT_FLOW_MOD`` message designates the table | |
694 | to which a flow is to be added. | |
695 | ||
696 | The Open vSwitch software switch implementation offers 255 flow tables. On | |
697 | packet ingress, only the first flow table (table 0) is searched, and the | |
698 | contents of the remaining tables are not considered in any way. Tables other | |
699 | than table 0 only come into play when an ``NXAST_RESUBMIT_TABLE`` action | |
700 | specifies another table to search. | |
701 | ||
702 | Tables 128 and above are reserved for use by the switch itself. Controllers | |
703 | should use only tables 0 through 127. | |
704 | ||
705 | ``OFPTC_*`` Table Configuration | |
706 | ------------------------------- | |
707 | ||
708 | This section covers the history of the ``OFPTC_*`` table configuration bits | |
709 | across OpenFlow versions. | |
710 | ||
711 | OpenFlow 1.0 flow tables had fixed configurations. | |
712 | ||
713 | OpenFlow 1.1 enabled controllers to configure behavior upon flow table miss and | |
714 | added the ``OFPTC_MISS_*`` constants for that purpose. ``OFPTC_*`` did not | |
715 | control anything else but it was nevertheless conceptualized as a set of | |
716 | bit-fields instead of an enum. OF1.1 added the ``OFPT_TABLE_MOD`` message to | |
717 | set ``OFPTC_MISS_*`` for a flow table and added the ``config`` field to the | |
718 | ``OFPST_TABLE`` reply to report the current setting. | |
719 | ||
720 | OpenFlow 1.2 did not change anything in this regard. | |
721 | ||
722 | OpenFlow 1.3 switched to another means to changing flow table miss behavior and | |
723 | deprecated ``OFPTC_MISS_*`` without adding any more ``OFPTC_*`` constants. | |
724 | This meant that ``OFPT_TABLE_MOD`` now had no purpose at all, but OF1.3 kept it | |
725 | around "for backward compatibility with older and newer versions of the | |
726 | specification." At the same time, OF1.3 introduced a new message | |
727 | OFPMP_TABLE_FEATURES that included a field ``config`` documented as reporting | |
728 | the ``OFPTC_*`` values set with ``OFPT_TABLE_MOD``; of course this served no | |
729 | real purpose because no ``OFPTC_*`` values are defined. OF1.3 did remove the | |
730 | ``OFPTC_*`` field from ``OFPMP_TABLE`` (previously named ``OFPST_TABLE``). | |
731 | ||
732 | OpenFlow 1.4 defined two new ``OFPTC_*`` constants, ``OFPTC_EVICTION`` and | |
733 | ``OFPTC_VACANCY_EVENTS``, using bits that did not overlap with ``OFPTC_MISS_*`` | |
734 | even though those bits had not been defined since OF1.2. ``OFPT_TABLE_MOD`` | |
735 | still controlled these settings. The field for ``OFPTC_*`` values in | |
736 | ``OFPMP_TABLE_FEATURES`` was renamed from ``config`` to ``capabilities`` and | |
737 | documented as reporting the flags that are supported in a ``OFPT_TABLE_MOD`` | |
738 | message. The ``OFPMP_TABLE_DESC`` message newly added in OF1.4 reported the | |
739 | ``OFPTC_*`` setting. | |
740 | ||
741 | OpenFlow 1.5 did not change anything in this regard. | |
742 | ||
743 | .. list-table:: Revisions | |
744 | :header-rows: 1 | |
745 | ||
746 | * - OpenFlow | |
747 | - ``OFPTC_*`` flags | |
748 | - ``TABLE_MOD`` | |
749 | - Statistics | |
750 | - ``TABLE_FEATURES`` | |
751 | - ``TABLE_DESC`` | |
752 | * - OF1.0 | |
753 | - none | |
754 | - no (\*)(+) | |
755 | - no (\*) | |
756 | - nothing (\*)(+) | |
757 | - no (\*)(+) | |
758 | * - OF1.1/1.2 | |
759 | - ``MISS_*`` | |
760 | - yes | |
761 | - yes | |
762 | - nothing (+) | |
763 | - no (+) | |
764 | * - OF1.3 | |
765 | - none | |
766 | - yes (\*) | |
767 | - no (\*) | |
768 | - config (\*) | |
769 | - no (\*)(+) | |
770 | * - OF1.4/1.5 | |
771 | - ``EVICTION``/``VACANCY_EVENTS`` | |
772 | - yes | |
773 | - no | |
774 | - capabilities | |
775 | - yes | |
776 | ||
777 | where: | |
778 | ||
779 | OpenFlow: | |
780 | The OpenFlow version(s). | |
781 | ||
782 | ``OFPTC_*`` flags: | |
783 | The ``OFPTC_*`` flags defined in those versions. | |
784 | ||
785 | ``TABLE_MOD``: | |
786 | Whether ``OFPT_TABLE_MOD`` can modify ``OFPTC_*`` flags. | |
787 | ||
788 | Statistics: | |
789 | Whether ``OFPST_TABLE/OFPMP_TABLE`` reports the ``OFPTC_*`` flags. | |
790 | ||
791 | ``TABLE_FEATURES``: | |
792 | What ``OFPMP_TABLE_FEATURES`` reports (if it exists): either the current | |
793 | configuration or the switch's capabilities. | |
794 | ||
795 | ``TABLE_DESC``: | |
796 | Whether ``OFPMP_TABLE_DESC`` reports the current configuration. | |
797 | ||
798 | (\*): Nothing to report/change anyway. | |
799 | ||
800 | (+): No such message. | |
801 | ||
802 | IPv6 | |
803 | ---- | |
804 | ||
805 | Open vSwitch supports stateless handling of IPv6 packets. Flows can be written | |
806 | to support matching TCP, UDP, and ICMPv6 headers within an IPv6 packet. Deeper | |
807 | matching of some Neighbor Discovery messages is also supported. | |
808 | ||
809 | IPv6 was not designed to interact well with middle-boxes. This, combined with | |
810 | Open vSwitch's stateless nature, have affected the processing of IPv6 traffic, | |
811 | which is detailed below. | |
812 | ||
813 | Extension Headers | |
814 | ~~~~~~~~~~~~~~~~~ | |
815 | ||
816 | The base IPv6 header is incredibly simple with the intention of only containing | |
817 | information relevant for routing packets between two endpoints. IPv6 relies | |
818 | heavily on the use of extension headers to provide any other functionality. | |
819 | Unfortunately, the extension headers were designed in such a way that it is | |
820 | impossible to move to the next header (including the layer-4 payload) unless | |
821 | the current header is understood. | |
822 | ||
823 | Open vSwitch will process the following extension headers and continue to the | |
824 | next header: | |
825 | ||
826 | - Fragment (see the next section) | |
827 | - AH (Authentication Header) | |
828 | - Hop-by-Hop Options | |
829 | - Routing | |
830 | - Destination Options | |
831 | ||
832 | When a header is encountered that is not in that list, it is considered | |
833 | "terminal". A terminal header's IPv6 protocol value is stored in ``nw_proto`` | |
834 | for matching purposes. If a terminal header is TCP, UDP, or ICMPv6, the packet | |
835 | will be further processed in an attempt to extract layer-4 information. | |
836 | ||
837 | Fragments | |
838 | ~~~~~~~~~ | |
839 | ||
840 | IPv6 requires that every link in the internet have an MTU of 1280 octets or | |
841 | greater (RFC 2460). As such, a terminal header (as described above in | |
842 | "Extension Headers") in the first fragment should generally be reachable. In | |
843 | this case, the terminal header's IPv6 protocol type is stored in the | |
844 | ``nw_proto`` field for matching purposes. If a terminal header cannot be found | |
845 | in the first fragment (one with a fragment offset of zero), the ``nw_proto`` | |
846 | field is set to 0. Subsequent fragments (those with a non-zero fragment | |
847 | offset) have the ``nw_proto`` field set to the IPv6 protocol type for fragments | |
848 | (44). | |
849 | ||
850 | Jumbograms | |
851 | ~~~~~~~~~~ | |
852 | ||
853 | An IPv6 jumbogram (RFC 2675) is a packet containing a payload longer than | |
854 | 65,535 octets. A jumbogram is only relevant in subnets with a link MTU greater | |
855 | than 65,575 octets, and are not required to be supported on nodes that do not | |
856 | connect to link with such large MTUs. Currently, Open vSwitch doesn't process | |
857 | jumbograms. | |
858 | ||
859 | In-Band Control | |
860 | --------------- | |
861 | ||
862 | Motivation | |
863 | ~~~~~~~~~~ | |
864 | ||
865 | An OpenFlow switch must establish and maintain a TCP network connection to its | |
866 | controller. There are two basic ways to categorize the network that this | |
867 | connection traverses: either it is completely separate from the one that the | |
868 | switch is otherwise controlling, or its path may overlap the network that the | |
869 | switch controls. We call the former case "out-of-band control", the latter | |
870 | case "in-band control". | |
871 | ||
872 | Out-of-band control has the following benefits: | |
873 | ||
874 | - Simplicity: Out-of-band control slightly simplifies the switch | |
875 | implementation. | |
876 | ||
877 | - Reliability: Excessive switch traffic volume cannot interfere with control | |
878 | traffic. | |
879 | ||
880 | - Integrity: Machines not on the control network cannot impersonate a switch or | |
881 | a controller. | |
882 | ||
883 | - Confidentiality: Machines not on the control network cannot snoop on control | |
884 | traffic. | |
885 | ||
886 | In-band control, on the other hand, has the following advantages: | |
887 | ||
888 | - No dedicated port: There is no need to dedicate a physical switch port to | |
889 | control, which is important on switches that have few ports (e.g. wireless | |
890 | routers, low-end embedded platforms). | |
891 | ||
892 | - No dedicated network: There is no need to build and maintain a separate | |
893 | control network. This is important in many environments because it reduces | |
894 | proliferation of switches and wiring. | |
895 | ||
896 | Open vSwitch supports both out-of-band and in-band control. This section | |
897 | describes the principles behind in-band control. See the description of the | |
898 | Controller table in ovs-vswitchd.conf.db(5) to configure OVS for in-band | |
899 | control. | |
900 | ||
901 | Principles | |
902 | ~~~~~~~~~~ | |
903 | ||
904 | The fundamental principle of in-band control is that an OpenFlow switch must | |
905 | recognize and switch control traffic without involving the OpenFlow controller. | |
906 | All the details of implementing in-band control are special cases of this | |
907 | principle. | |
908 | ||
909 | The rationale for this principle is simple. If the switch does not handle | |
910 | in-band control traffic itself, then it will be caught in a contradiction: it | |
911 | must contact the controller, but it cannot, because only the controller can set | |
912 | up the flows that are needed to contact the controller. | |
913 | ||
914 | The following points describe important special cases of this principle. | |
915 | ||
916 | - In-band control must be implemented regardless of whether the switch is | |
917 | connected. | |
918 | ||
919 | It is tempting to implement the in-band control rules only when the switch is | |
920 | not connected to the controller, using the reasoning that the controller | |
921 | should have complete control once it has established a connection with the | |
922 | switch. | |
923 | ||
924 | This does not work in practice. Consider the case where the switch is | |
925 | connected to the controller. Occasionally it can happen that the controller | |
926 | forgets or otherwise needs to obtain the MAC address of the switch. To do | |
927 | so, the controller sends a broadcast ARP request. A switch that implements | |
928 | the in-band control rules only when it is disconnected will then send an | |
929 | ``OFPT_PACKET_IN`` message up to the controller. The controller will be | |
930 | unable to respond, because it does not know the MAC address of the switch. | |
931 | This is a deadlock situation that can only be resolved by the switch noticing | |
932 | that its connection to the controller has hung and reconnecting. | |
933 | ||
934 | - In-band control must override flows set up by the controller. | |
935 | ||
936 | It is reasonable to assume that flows set up by the OpenFlow controller | |
937 | should take precedence over in-band control, on the basis that the controller | |
938 | should be in charge of the switch. | |
939 | ||
940 | Again, this does not work in practice. Reasonable controller implementations | |
941 | may set up a "last resort" fallback rule that wildcards every field and, | |
942 | e.g., sends it up to the controller or discards it. If a controller does | |
943 | that, then it will isolate itself from the switch. | |
944 | ||
945 | - The switch must recognize all control traffic. | |
946 | ||
947 | The fundamental principle of in-band control states, in part, that a switch | |
948 | must recognize control traffic without involving the OpenFlow controller. | |
949 | More specifically, the switch must recognize *all* control traffic. "False | |
950 | negatives", that is, packets that constitute control traffic but that the | |
951 | switch does not recognize as control traffic, lead to control traffic storms. | |
952 | ||
953 | Consider an OpenFlow switch that only recognizes control packets sent to or | |
954 | from that switch. Now suppose that two switches of this type, named A and B, | |
955 | are connected to ports on an Ethernet hub (not a switch) and that an OpenFlow | |
956 | controller is connected to a third hub port. In this setup, control traffic | |
957 | sent by switch A will be seen by switch B, which will send it to the | |
958 | controller as part of an OFPT_PACKET_IN message. Switch A will then see the | |
959 | OFPT_PACKET_IN message's packet, re-encapsulate it in another OFPT_PACKET_IN, | |
960 | and send it to the controller. Switch B will then see that OFPT_PACKET_IN, | |
961 | and so on in an infinite loop. | |
962 | ||
963 | Incidentally, the consequences of "false positives", where packets that are | |
964 | not control traffic are nevertheless recognized as control traffic, are much | |
965 | less severe. The controller will not be able to control their behavior, but | |
966 | the network will remain in working order. False positives do constitute a | |
967 | security problem. | |
968 | ||
969 | - The switch should use echo-requests to detect disconnection. | |
970 | ||
971 | TCP will notice that a connection has hung, but this can take a considerable | |
972 | amount of time. For example, with default settings the Linux kernel TCP | |
973 | implementation will retransmit for between 13 and 30 minutes, depending on | |
974 | the connection's retransmission timeout, according to kernel documentation. | |
975 | This is far too long for a switch to be disconnected, so an OpenFlow switch | |
976 | should implement its own connection timeout. OpenFlow ``OFPT_ECHO_REQUEST`` | |
977 | messages are the best way to do this, since they test the OpenFlow connection | |
978 | itself. | |
979 | ||
980 | Implementation | |
981 | ~~~~~~~~~~~~~~ | |
982 | ||
983 | This section describes how Open vSwitch implements in-band control. Correctly | |
984 | implementing in-band control has proven difficult due to its many subtleties, | |
985 | and has thus gone through many iterations. Please read through and understand | |
986 | the reasoning behind the chosen rules before making modifications. | |
987 | ||
988 | Open vSwitch implements in-band control as "hidden" flows, that is, flows that | |
989 | are not visible through OpenFlow, and at a higher priority than wildcarded | |
990 | flows can be set up through OpenFlow. This is done so that the OpenFlow | |
991 | controller cannot interfere with them and possibly break connectivity with its | |
992 | switches. It is possible to see all flows, including in-band ones, with the | |
993 | ovs-appctl "bridge/dump-flows" command. | |
994 | ||
995 | The Open vSwitch implementation of in-band control can hide traffic to | |
996 | arbitrary "remotes", where each remote is one TCP port on one IP address. | |
997 | Currently the remotes are automatically configured as the in-band OpenFlow | |
998 | controllers plus the OVSDB managers, if any. (The latter is a requirement | |
999 | because OVSDB managers are responsible for configuring OpenFlow controllers, so | |
1000 | if the manager cannot be reached then OpenFlow cannot be reconfigured.) | |
1001 | ||
1002 | The following rules (with the OFPP_NORMAL action) are set up on any bridge that | |
1003 | has any remotes: | |
1004 | ||
1005 | (a) | |
1006 | DHCP requests sent from the local port. | |
1007 | (b) | |
1008 | ARP replies to the local port's MAC address. | |
1009 | (c) | |
1010 | ARP requests from the local port's MAC address. | |
1011 | ||
1012 | In-band also sets up the following rules for each unique next-hop MAC address | |
1013 | for the remotes' IPs (the "next hop" is either the remote itself, if it is on a | |
1014 | local subnet, or the gateway to reach the remote): | |
1015 | ||
1016 | (d) | |
1017 | ARP replies to the next hop's MAC address. | |
1018 | (e) | |
1019 | ARP requests from the next hop's MAC address. | |
1020 | ||
1021 | In-band also sets up the following rules for each unique remote IP address: | |
1022 | ||
1023 | (f) | |
1024 | ARP replies containing the remote's IP address as a target. | |
1025 | (g) | |
1026 | ARP requests containing the remote's IP address as a source. | |
1027 | ||
1028 | In-band also sets up the following rules for each unique remote (IP,port) pair: | |
1029 | ||
1030 | (h) | |
1031 | TCP traffic to the remote's IP and port. | |
1032 | (i) | |
1033 | TCP traffic from the remote's IP and port. | |
1034 | ||
1035 | The goal of these rules is to be as narrow as possible to allow a switch to | |
1036 | join a network and be able to communicate with the remotes. As mentioned | |
1037 | earlier, these rules have higher priority than the controller's rules, so if | |
1038 | they are too broad, they may prevent the controller from implementing its | |
1039 | policy. As such, in-band actively monitors some aspects of flow and packet | |
1040 | processing so that the rules can be made more precise. | |
1041 | ||
1042 | In-band control monitors attempts to add flows into the datapath that could | |
1043 | interfere with its duties. The datapath only allows exact match entries, so | |
1044 | in-band control is able to be very precise about the flows it prevents. Flows | |
1045 | that miss in the datapath are sent to userspace to be processed, so preventing | |
1046 | these flows from being cached in the "fast path" does not affect correctness. | |
1047 | The only type of flow that is currently prevented is one that would prevent | |
1048 | DHCP replies from being seen by the local port. For example, a rule that | |
1049 | forwarded all DHCP traffic to the controller would not be allowed, but one that | |
1050 | forwarded to all ports (including the local port) would. | |
1051 | ||
1052 | As mentioned earlier, packets that miss in the datapath are sent to the | |
1053 | userspace for processing. The userspace has its own flow table, the | |
1054 | "classifier", so in-band checks whether any special processing is needed before | |
1055 | the classifier is consulted. If a packet is a DHCP response to a request from | |
1056 | the local port, the packet is forwarded to the local port, regardless of the | |
1057 | flow table. Note that this requires L7 processing of DHCP replies to determine | |
1058 | whether the 'chaddr' field matches the MAC address of the local port. | |
1059 | ||
1060 | It is interesting to note that for an L3-based in-band control mechanism, the | |
1061 | majority of rules are devoted to ARP traffic. At first glance, some of these | |
1062 | rules appear redundant. However, each serves an important role. First, in | |
1063 | order to determine the MAC address of the remote side (controller or gateway) | |
1064 | for other ARP rules, we must allow ARP traffic for our local port with rules | |
1065 | (b) and (c). If we are between a switch and its connection to the remote, we | |
1066 | have to allow the other switch's ARP traffic to through. This is done with | |
1067 | rules (d) and (e), since we do not know the addresses of the other switches a | |
1068 | priori, but do know the remote's or gateway's. Finally, if the remote is | |
1069 | running in a local guest VM that is not reached through the local port, the | |
1070 | switch that is connected to the VM must allow ARP traffic based on the remote's | |
1071 | IP address, since it will not know the MAC address of the local port that is | |
1072 | sending the traffic or the MAC address of the remote in the guest VM. | |
1073 | ||
1074 | With a few notable exceptions below, in-band should work in most network | |
1075 | setups. The following are considered "supported" in the current | |
1076 | implementation: | |
1077 | ||
1078 | - Locally Connected. The switch and remote are on the same subnet. This uses | |
1079 | rules (a), (b), (c), (h), and (i). | |
1080 | ||
1081 | - Reached through Gateway. The switch and remote are on different subnets and | |
1082 | must go through a gateway. This uses rules (a), (b), (c), (h), and (i). | |
1083 | ||
1084 | - Between Switch and Remote. This switch is between another switch and the | |
1085 | remote, and we want to allow the other switch's traffic through. This uses | |
1086 | rules (d), (e), (h), and (i). It uses (b) and (c) indirectly in order to | |
1087 | know the MAC address for rules (d) and (e). Note that DHCP for the other | |
1088 | switch will not work unless an OpenFlow controller explicitly lets this | |
1089 | switch pass the traffic. | |
1090 | ||
1091 | - Between Switch and Gateway. This switch is between another switch and the | |
1092 | gateway, and we want to allow the other switch's traffic through. This uses | |
1093 | the same rules and logic as the "Between Switch and Remote" configuration | |
1094 | described earlier. | |
1095 | ||
1096 | - Remote on Local VM. The remote is a guest VM on the system running in-band | |
1097 | control. This uses rules (a), (b), (c), (h), and (i). | |
1098 | ||
1099 | - Remote on Local VM with Different Networks. The remote is a guest VM on the | |
1100 | system running in-band control, but the local port is not used to connect to | |
1101 | the remote. For example, an IP address is configured on eth0 of the switch. | |
1102 | The remote's VM is connected through eth1 of the switch, but an IP address | |
1103 | has not been configured for that port on the switch. As such, the switch | |
1104 | will use eth0 to connect to the remote, and eth1's rules about the local port | |
1105 | will not work. In the example, the switch attached to eth0 would use rules | |
1106 | (a), (b), (c), (h), and (i) on eth0. The switch attached to eth1 would use | |
1107 | rules (f), (g), (h), and (i). | |
1108 | ||
1109 | The following are explicitly *not* supported by in-band control: | |
1110 | ||
1111 | - Specify Remote by Name. Currently, the remote must be identified by IP | |
1112 | address. A naive approach would be to permit all DNS traffic. | |
1113 | Unfortunately, this would prevent the controller from defining any policy | |
1114 | over DNS. Since switches that are located behind us need to connect to the | |
1115 | remote, in-band cannot simply add a rule that allows DNS traffic from the | |
1116 | local port. The "correct" way to support this is to parse DNS requests to | |
1117 | allow all traffic related to a request for the remote's name through. Due to | |
1118 | the potential security problems and amount of processing, we decided to hold | |
1119 | off for the time-being. | |
1120 | ||
1121 | - Differing Remotes for Switches. All switches must know the L3 addresses for | |
1122 | all the remotes that other switches may use, since rules need to be set up to | |
1123 | allow traffic related to those remotes through. See rules (f), (g), (h), and | |
1124 | (i). | |
1125 | ||
1126 | - Differing Routes for Switches. In order for the switch to allow other | |
1127 | switches to connect to a remote through a gateway, it allows the gateway's | |
1128 | traffic through with rules (d) and (e). If the routes to the remote differ | |
1129 | for the two switches, we will not know the MAC address of the alternate | |
1130 | gateway. | |
1131 | ||
1132 | Action Reproduction | |
1133 | ------------------- | |
1134 | ||
1135 | It seems likely that many controllers, at least at startup, use the OpenFlow | |
1136 | "flow statistics" request to obtain existing flows, then compare the flows' | |
1137 | actions against the actions that they expect to find. Before version 1.8.0, | |
1138 | Open vSwitch always returned exact, byte-for-byte copies of the actions that | |
1139 | had been added to the flow table. The current version of Open vSwitch does not | |
1140 | always do this in some exceptional cases. This section lists the exceptions | |
1141 | that controller authors must keep in mind if they compare actual actions | |
1142 | against desired actions in a bytewise fashion: | |
1143 | ||
1144 | - Open vSwitch zeros padding bytes in action structures, regardless of their | |
1145 | values when the flows were added. | |
1146 | ||
1147 | - Open vSwitch "normalizes" the instructions in OpenFlow 1.1 (and later) in the | |
1148 | following way: | |
1149 | ||
1150 | * OVS sorts the instructions into the following order: Apply-Actions, | |
1151 | Clear-Actions, Write-Actions, Write-Metadata, Goto-Table. | |
1152 | ||
1153 | * OVS drops Apply-Actions instructions that have empty action lists. | |
1154 | ||
1155 | * OVS drops Write-Actions instructions that have empty action sets. | |
1156 | ||
1157 | Please report other discrepancies, if you notice any, so that we can fix or | |
1158 | document them. | |
1159 | ||
1160 | Suggestions | |
1161 | ----------- | |
1162 | ||
1163 | Suggestions to improve Open vSwitch are welcome at discuss@openvswitch.org. |