]>
Commit | Line | Data |
---|---|---|
542cc9bb TG |
1 | Design Decisions In Open vSwitch |
2 | ================================ | |
d31f1109 JP |
3 | |
4 | This document describes design decisions that went into implementing | |
5 | Open vSwitch. While we believe these to be reasonable decisions, it is | |
6 | impossible to predict how Open vSwitch will be used in all environments. | |
7 | Understanding assumptions made by Open vSwitch is critical to a | |
8 | successful deployment. The end of this document contains contact | |
9 | information that can be used to let us know how we can make Open vSwitch | |
10 | more generally useful. | |
11 | ||
80d5aefd BP |
12 | Asynchronous Messages |
13 | ===================== | |
14 | ||
15 | Over time, Open vSwitch has added many knobs that control whether a | |
16 | given controller receives OpenFlow asynchronous messages. This | |
17 | section describes how all of these features interact. | |
18 | ||
19 | First, a service controller never receives any asynchronous messages | |
4550b647 MM |
20 | unless it changes its miss_send_len from the service controller |
21 | default of zero in one of the following ways: | |
22 | ||
542cc9bb | 23 | - Sending an OFPT_SET_CONFIG message with nonzero miss_send_len. |
4550b647 | 24 | |
542cc9bb TG |
25 | - Sending any NXT_SET_ASYNC_CONFIG message: as a side effect, this |
26 | message changes the miss_send_len to | |
27 | OFP_DEFAULT_MISS_SEND_LEN (128) for service controllers. | |
80d5aefd BP |
28 | |
29 | Second, OFPT_FLOW_REMOVED and NXT_FLOW_REMOVED messages are generated | |
30 | only if the flow that was removed had the OFPFF_SEND_FLOW_REM flag | |
31 | set. | |
32 | ||
a7349929 BP |
33 | Third, OFPT_PACKET_IN and NXT_PACKET_IN messages are sent only to |
34 | OpenFlow controller connections that have the correct connection ID | |
35 | (see "struct nx_controller_id" and "struct nx_action_controller"): | |
36 | ||
542cc9bb TG |
37 | - For packet-in messages generated by a NXAST_CONTROLLER action, |
38 | the controller ID specified in the action. | |
a7349929 | 39 | |
542cc9bb TG |
40 | - For other packet-in messages, controller ID zero. (This is the |
41 | default ID when an OpenFlow controller does not configure one.) | |
a7349929 | 42 | |
80d5aefd BP |
43 | Finally, Open vSwitch consults a per-connection table indexed by the |
44 | message type, reason code, and current role. The following table | |
45 | shows how this table is initialized by default when an OpenFlow | |
46 | connection is made. An entry labeled "yes" means that the message is | |
47 | sent, an entry labeled "---" means that the message is suppressed. | |
48 | ||
542cc9bb | 49 | ``` |
80d5aefd BP |
50 | master/ |
51 | message and reason code other slave | |
52 | ---------------------------------------- ------- ----- | |
53 | OFPT_PACKET_IN / NXT_PACKET_IN | |
54 | OFPR_NO_MATCH yes --- | |
55 | OFPR_ACTION yes --- | |
56 | OFPR_INVALID_TTL --- --- | |
029ca940 | 57 | OFPR_ACTION_SET (OF1.4+) yes --- |
3a11fd5b | 58 | OFPR_GROUP (OF1.4+) yes --- |
80d5aefd BP |
59 | |
60 | OFPT_FLOW_REMOVED / NXT_FLOW_REMOVED | |
61 | OFPRR_IDLE_TIMEOUT yes --- | |
62 | OFPRR_HARD_TIMEOUT yes --- | |
63 | OFPRR_DELETE yes --- | |
98090482 NR |
64 | OFPRR_GROUP_DELETE (OF1.4+) yes --- |
65 | OFPRR_METER_DELETE (OF1.4+) yes --- | |
66 | OFPRR_EVICTION (OF1.4+) yes --- | |
80d5aefd BP |
67 | |
68 | OFPT_PORT_STATUS | |
69 | OFPPR_ADD yes yes | |
70 | OFPPR_DELETE yes yes | |
71 | OFPPR_MODIFY yes yes | |
98090482 NR |
72 | |
73 | OFPT_ROLE_REQUEST / OFPT_ROLE_REPLY (OF1.4+) | |
74 | OFPCRR_MASTER_REQUEST --- --- | |
75 | OFPCRR_CONFIG --- --- | |
76 | OFPCRR_EXPERIMENTER --- --- | |
77 | ||
78 | OFPT_TABLE_STATUS (OF1.4+) | |
79 | OFPTR_VACANCY_DOWN --- --- | |
80 | OFPTR_VACANCY_UP --- --- | |
81 | ||
82 | OFPT_REQUESTFORWARD (OF1.4+) | |
83 | OFPRFR_GROUP_MOD --- --- | |
84 | OFPRFR_METER_MOD --- --- | |
542cc9bb | 85 | ``` |
80d5aefd BP |
86 | |
87 | The NXT_SET_ASYNC_CONFIG message directly sets all of the values in | |
88 | this table for the current connection. The | |
89 | OFPC_INVALID_TTL_TO_CONTROLLER bit in the OFPT_SET_CONFIG message | |
90 | controls the setting for OFPR_INVALID_TTL for the "master" role. | |
91 | ||
92 | ||
93 | OFPAT_ENQUEUE | |
94 | ============= | |
82172632 EJ |
95 | |
96 | The OpenFlow 1.0 specification requires the output port of the OFPAT_ENQUEUE | |
97 | action to "refer to a valid physical port (i.e. < OFPP_MAX) or OFPP_IN_PORT". | |
98 | Although OFPP_LOCAL is not less than OFPP_MAX, it is an 'internal' port which | |
99 | can have QoS applied to it in Linux. Since we allow the OFPAT_ENQUEUE to apply | |
100 | to 'internal' ports whose port numbers are less than OFPP_MAX, we interpret | |
101 | OFPP_LOCAL as a physical port and support OFPAT_ENQUEUE on it as well. | |
102 | ||
d31f1109 | 103 | |
12442ec5 BP |
104 | OFPT_FLOW_MOD |
105 | ============= | |
106 | ||
3432cb4e BP |
107 | The OpenFlow specification for the behavior of OFPT_FLOW_MOD is |
108 | confusing. The following tables summarize the Open vSwitch | |
12442ec5 BP |
109 | implementation of its behavior in the following categories: |
110 | ||
542cc9bb TG |
111 | - "match on priority": Whether the flow_mod acts only on flows |
112 | whose priority matches that included in the flow_mod message. | |
12442ec5 | 113 | |
542cc9bb TG |
114 | - "match on out_port": Whether the flow_mod acts only on flows |
115 | that output to the out_port included in the flow_mod message (if | |
116 | out_port is not OFPP_NONE). OpenFlow 1.1 and later have a | |
117 | similar feature (not listed separately here) for out_group. | |
3432cb4e | 118 | |
542cc9bb TG |
119 | - "match on flow_cookie": Whether the flow_mod acts only on flows |
120 | whose flow_cookie matches an optional controller-specified value | |
121 | and mask. | |
12442ec5 | 122 | |
542cc9bb TG |
123 | - "updates flow_cookie": Whether the flow_mod changes the |
124 | flow_cookie of the flow or flows that it matches to the | |
125 | flow_cookie included in the flow_mod message. | |
12442ec5 | 126 | |
542cc9bb TG |
127 | - "updates OFPFF_ flags": Whether the flow_mod changes the |
128 | OFPFF_SEND_FLOW_REM flag of the flow or flows that it matches to | |
129 | the setting included in the flags of the flow_mod message. | |
12442ec5 | 130 | |
542cc9bb TG |
131 | - "honors OFPFF_CHECK_OVERLAP": Whether the OFPFF_CHECK_OVERLAP |
132 | flag in the flow_mod is significant. | |
12442ec5 | 133 | |
542cc9bb TG |
134 | - "updates idle_timeout" and "updates hard_timeout": Whether the |
135 | idle_timeout and hard_timeout in the flow_mod, respectively, | |
136 | have an effect on the flow or flows matched by the flow_mod. | |
12442ec5 | 137 | |
542cc9bb TG |
138 | - "updates idle timer": Whether the flow_mod resets the per-flow |
139 | timer that measures how long a flow has been idle. | |
12442ec5 | 140 | |
542cc9bb TG |
141 | - "updates hard timer": Whether the flow_mod resets the per-flow |
142 | timer that measures how long it has been since a flow was | |
143 | modified. | |
12442ec5 | 144 | |
542cc9bb TG |
145 | - "zeros counters": Whether the flow_mod resets per-flow packet |
146 | and byte counters to zero. | |
12442ec5 | 147 | |
542cc9bb TG |
148 | - "may add a new flow": Whether the flow_mod may add a new flow to |
149 | the flow table. (Obviously this is always true for "add" | |
150 | commands but in some OpenFlow versions "modify" and | |
151 | "modify-strict" can also add new flows.) | |
3432cb4e | 152 | |
542cc9bb TG |
153 | - "sends flow_removed message": Whether the flow_mod generates a |
154 | flow_removed message for the flow or flows that it affects. | |
12442ec5 BP |
155 | |
156 | An entry labeled "yes" means that the flow mod type does have the | |
157 | indicated behavior, "---" means that it does not, an empty cell means | |
158 | that the property is not applicable, and other values are explained | |
159 | below the table. | |
160 | ||
3432cb4e BP |
161 | OpenFlow 1.0 |
162 | ------------ | |
163 | ||
542cc9bb | 164 | ``` |
12442ec5 BP |
165 | MODIFY DELETE |
166 | ADD MODIFY STRICT DELETE STRICT | |
167 | === ====== ====== ====== ====== | |
3432cb4e | 168 | match on priority yes --- yes --- yes |
906087ee | 169 | match on out_port --- --- --- yes yes |
3432cb4e BP |
170 | match on flow_cookie --- --- --- --- --- |
171 | match on table_id --- --- --- --- --- | |
172 | controller chooses table_id --- --- --- | |
12442ec5 BP |
173 | updates flow_cookie yes yes yes |
174 | updates OFPFF_SEND_FLOW_REM yes + + | |
175 | honors OFPFF_CHECK_OVERLAP yes + + | |
176 | updates idle_timeout yes + + | |
177 | updates hard_timeout yes + + | |
178 | resets idle timer yes + + | |
179 | resets hard timer yes yes yes | |
180 | zeros counters yes + + | |
3432cb4e BP |
181 | may add a new flow yes yes yes |
182 | sends flow_removed message --- --- --- % % | |
183 | ||
184 | (+) "modify" and "modify-strict" only take these actions when they | |
185 | create a new flow, not when they update an existing flow. | |
186 | ||
187 | (%) "delete" and "delete_strict" generates a flow_removed message if | |
188 | the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set. | |
189 | (Each controller can separately control whether it wants to | |
190 | receive the generated messages.) | |
542cc9bb | 191 | ``` |
3432cb4e BP |
192 | |
193 | OpenFlow 1.1 | |
194 | ------------ | |
195 | ||
196 | OpenFlow 1.1 makes these changes: | |
197 | ||
542cc9bb TG |
198 | - The controller now must specify the table_id of the flow match |
199 | searched and into which a flow may be inserted. Behavior for a | |
200 | table_id of 255 is undefined. | |
3432cb4e | 201 | |
542cc9bb | 202 | - A flow_mod, except an "add", can now match on the flow_cookie. |
3432cb4e | 203 | |
542cc9bb TG |
204 | - When a flow_mod matches on the flow_cookie, "modify" and |
205 | "modify-strict" never insert a new flow. | |
3432cb4e | 206 | |
542cc9bb | 207 | ``` |
3432cb4e BP |
208 | MODIFY DELETE |
209 | ADD MODIFY STRICT DELETE STRICT | |
210 | === ====== ====== ====== ====== | |
211 | match on priority yes --- yes --- yes | |
212 | match on out_port --- --- --- yes yes | |
213 | match on flow_cookie --- yes yes yes yes | |
214 | match on table_id yes yes yes yes yes | |
215 | controller chooses table_id yes yes yes | |
216 | updates flow_cookie yes --- --- | |
217 | updates OFPFF_SEND_FLOW_REM yes + + | |
218 | honors OFPFF_CHECK_OVERLAP yes + + | |
219 | updates idle_timeout yes + + | |
220 | updates hard_timeout yes + + | |
221 | resets idle timer yes + + | |
222 | resets hard timer yes yes yes | |
223 | zeros counters yes + + | |
224 | may add a new flow yes # # | |
12442ec5 BP |
225 | sends flow_removed message --- --- --- % % |
226 | ||
227 | (+) "modify" and "modify-strict" only take these actions when they | |
228 | create a new flow, not when they update an existing flow. | |
229 | ||
230 | (%) "delete" and "delete_strict" generates a flow_removed message if | |
231 | the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set. | |
232 | (Each controller can separately control whether it wants to | |
233 | receive the generated messages.) | |
234 | ||
3432cb4e BP |
235 | (#) "modify" and "modify-strict" only add a new flow if the flow_mod |
236 | does not match on any bits of the flow cookie | |
542cc9bb | 237 | ``` |
3432cb4e BP |
238 | |
239 | OpenFlow 1.2 | |
240 | ------------ | |
241 | ||
242 | OpenFlow 1.2 makes these changes: | |
243 | ||
542cc9bb TG |
244 | - Only "add" commands ever add flows, "modify" and "modify-strict" |
245 | never do. | |
3432cb4e | 246 | |
542cc9bb TG |
247 | - A new flag OFPFF_RESET_COUNTS now controls whether "modify" and |
248 | "modify-strict" reset counters, whereas previously they never | |
249 | reset counters (except when they inserted a new flow). | |
3432cb4e | 250 | |
542cc9bb | 251 | ``` |
3432cb4e BP |
252 | MODIFY DELETE |
253 | ADD MODIFY STRICT DELETE STRICT | |
254 | === ====== ====== ====== ====== | |
255 | match on priority yes --- yes --- yes | |
256 | match on out_port --- --- --- yes yes | |
257 | match on flow_cookie --- yes yes yes yes | |
258 | match on table_id yes yes yes yes yes | |
259 | controller chooses table_id yes yes yes | |
260 | updates flow_cookie yes --- --- | |
261 | updates OFPFF_SEND_FLOW_REM yes --- --- | |
262 | honors OFPFF_CHECK_OVERLAP yes --- --- | |
263 | updates idle_timeout yes --- --- | |
264 | updates hard_timeout yes --- --- | |
265 | resets idle timer yes --- --- | |
266 | resets hard timer yes yes yes | |
267 | zeros counters yes & & | |
268 | may add a new flow yes --- --- | |
269 | sends flow_removed message --- --- --- % % | |
270 | ||
271 | (%) "delete" and "delete_strict" generates a flow_removed message if | |
272 | the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set. | |
273 | (Each controller can separately control whether it wants to | |
274 | receive the generated messages.) | |
275 | ||
276 | (&) "modify" and "modify-strict" reset counters if the | |
277 | OFPFF_RESET_COUNTS flag is specified. | |
542cc9bb | 278 | ``` |
3432cb4e BP |
279 | |
280 | OpenFlow 1.3 | |
281 | ------------ | |
282 | ||
283 | OpenFlow 1.3 makes these changes: | |
284 | ||
542cc9bb TG |
285 | - Behavior for a table_id of 255 is now defined, for "delete" and |
286 | "delete-strict" commands, as meaning to delete from all tables. | |
287 | A table_id of 255 is now explicitly invalid for other commands. | |
3432cb4e | 288 | |
542cc9bb TG |
289 | - New flags OFPFF_NO_PKT_COUNTS and OFPFF_NO_BYT_COUNTS for "add" |
290 | operations. | |
3432cb4e BP |
291 | |
292 | The table for 1.3 is the same as the one shown above for 1.2. | |
293 | ||
12442ec5 | 294 | |
c37c0382 | 295 | OpenFlow 1.4 |
82c22d34 BP |
296 | ----------- |
297 | ||
298 | OpenFlow 1.4 makes these changes: | |
299 | ||
300 | - Adds the "importance" field to flow_mods, but it does not | |
301 | explicitly specify which kinds of flow_mods set the importance. | |
302 | For consistency, Open vSwitch uses the same rule for importance | |
303 | as for idle_timeout and hard_timeout, that is, only an "ADD" | |
304 | flow_mod sets the importance. (This issue has been filed with | |
305 | the ONF as EXT-496.) | |
c37c0382 | 306 | |
82c22d34 BP |
307 | - Eviction Mechanism to automatically delete entries of lower |
308 | importance to make space for newer entries. | |
c37c0382 | 309 | |
1c38055d JR |
310 | |
311 | OpenFlow 1.4 Bundles | |
312 | ==================== | |
313 | ||
314 | Open vSwitch makes all flow table modifications atomically, i.e., any | |
315 | datapath packet only sees flow table configurations either before or | |
316 | after any change made by any flow_mod. For example, if a controller | |
317 | removes all flows with a single OpenFlow "flow_mod", no packet sees an | |
318 | intermediate version of the OpenFlow pipeline where only some of the | |
319 | flows have been deleted. | |
320 | ||
321 | It should be noted that Open vSwitch caches datapath flows, and that | |
322 | the cached flows are NOT flushed immediately when a flow table | |
323 | changes. Instead, the datapath flows are revalidated against the new | |
324 | flow table as soon as possible, and usually within one second of the | |
325 | modification. This design amortizes the cost of datapath cache | |
326 | flushing across multiple flow table changes, and has a significant | |
327 | performance effect during simultaneous heavy flow table churn and high | |
328 | traffic load. This means that different cached datapath flows may | |
329 | have been computed based on a different flow table configurations, but | |
330 | each of the datapath flows is guaranteed to have been computed over a | |
331 | coherent view of the flow tables, as described above. | |
332 | ||
333 | With OpenFlow 1.4 bundles this atomicity can be extended across an | |
334 | arbitrary set of flow_mods. Bundles are supported for flow_mod and | |
335 | port_mod messages only. For flow_mods, both 'atomic' and 'ordered' | |
336 | bundle flags are trivially supported, as all bundled messages are | |
337 | executed in the order they were added and all flow table modifications | |
338 | are now atomic to the datapath. Port mods may not appear in atomic | |
339 | bundles, as port status modifications are not atomic. | |
340 | ||
341 | To support bundles, ovs-ofctl has a '--bundle' option that makes the | |
342 | flow mod commands ('add-flow', 'add-flows', 'mod-flows', 'del-flows', | |
343 | and 'replace-flows') use an OpenFlow 1.4 bundle to operate the | |
344 | modifications as a single atomic transaction. If any of the flow mods | |
345 | in a transaction fail, none of them are executed. All flow mods in a | |
346 | bundle appear to datapath lookups simultaneously. | |
347 | ||
348 | Furthermore, ovs-ofctl 'add-flow' and 'add-flows' commands now accept | |
349 | arbitrary flow mods as an input by allowing the flow specification to | |
350 | start with an explicit 'add', 'modify', 'modify_strict', 'delete', or | |
351 | 'delete_strict' keyword. A missing keyword is treated as 'add', so | |
352 | this is fully backwards compatible. With the new '--bundle' option | |
353 | all the flow mods are executed as a single atomic transaction using an | |
354 | OpenFlow 1.4 bundle. Without the '--bundle' option the flow mods are | |
355 | executed in order up to the first failing flow_mod, and in case of an | |
356 | error the earlier successful flow_mods are not rolled back. | |
357 | ||
358 | ||
4d197ebb BP |
359 | OFPT_PACKET_IN |
360 | ============== | |
361 | ||
362 | The OpenFlow 1.1 specification for OFPT_PACKET_IN is confusing. The | |
363 | definition in OF1.1 openflow.h is[*]: | |
364 | ||
542cc9bb | 365 | ``` |
4d197ebb BP |
366 | /* Packet received on port (datapath -> controller). */ |
367 | struct ofp_packet_in { | |
368 | struct ofp_header header; | |
369 | uint32_t buffer_id; /* ID assigned by datapath. */ | |
370 | uint32_t in_port; /* Port on which frame was received. */ | |
371 | uint32_t in_phy_port; /* Physical Port on which frame was received. */ | |
372 | uint16_t total_len; /* Full length of frame. */ | |
373 | uint8_t reason; /* Reason packet is being sent (one of OFPR_*) */ | |
374 | uint8_t table_id; /* ID of the table that was looked up */ | |
375 | uint8_t data[0]; /* Ethernet frame, halfway through 32-bit word, | |
376 | so the IP header is 32-bit aligned. The | |
377 | amount of data is inferred from the length | |
378 | field in the header. Because of padding, | |
379 | offsetof(struct ofp_packet_in, data) == | |
380 | sizeof(struct ofp_packet_in) - 2. */ | |
381 | }; | |
382 | OFP_ASSERT(sizeof(struct ofp_packet_in) == 24); | |
542cc9bb | 383 | ``` |
4d197ebb BP |
384 | |
385 | The confusing part is the comment on the data[] member. This comment | |
386 | is a leftover from OF1.0 openflow.h, in which the comment was correct: | |
387 | sizeof(struct ofp_packet_in) is 20 in OF1.0 and offsetof(struct | |
388 | ofp_packet_in, data) is 18. When OF1.1 was written, the structure | |
389 | members were changed but the comment was carelessly not updated, and | |
390 | the comment became wrong: sizeof(struct ofp_packet_in) and | |
391 | offsetof(struct ofp_packet_in, data) are both 24 in OF1.1. | |
392 | ||
393 | That leaves the question of how to implement ofp_packet_in in OF1.1. | |
394 | The OpenFlow reference implementation for OF1.1 does not include any | |
395 | padding, that is, the first byte of the encapsulated frame immediately | |
396 | follows the 'table_id' member without a gap. Open vSwitch therefore | |
397 | implements it the same way for compatibility. | |
398 | ||
399 | For an earlier discussion, please see the thread archived at: | |
400 | https://mailman.stanford.edu/pipermail/openflow-discuss/2011-August/002604.html | |
401 | ||
402 | [*] The quoted definition is directly from OF1.1. Definitions used | |
403 | inside OVS omit the 8-byte ofp_header members, so the sizes in | |
404 | this discussion are 8 bytes larger than those declared in OVS | |
405 | header files. | |
406 | ||
407 | ||
df778240 BP |
408 | VLAN Matching |
409 | ============= | |
410 | ||
411 | The 802.1Q VLAN header causes more trouble than any other 4 bytes in | |
412 | networking. More specifically, three versions of OpenFlow and Open | |
413 | vSwitch have among them four different ways to match the contents and | |
414 | presence of the VLAN header. The following table describes how each | |
415 | version works. | |
416 | ||
417 | Match NXM OF1.0 OF1.1 OF1.2 | |
418 | ----- --------- ----------- ----------- ------------ | |
419 | [1] 0000/0000 ????/1,??/? ????/1,??/? 0000/0000,-- | |
420 | [2] 0000/ffff ffff/0,??/? ffff/0,??/? 0000/ffff,-- | |
421 | [3] 1xxx/1fff 0xxx/0,??/1 0xxx/0,??/1 1xxx/ffff,-- | |
422 | [4] z000/f000 ????/1,0y/0 fffe/0,0y/0 1000/1000,0y | |
423 | [5] zxxx/ffff 0xxx/0,0y/0 0xxx/0,0y/0 1xxx/ffff,0y | |
424 | [6] 0000/0fff <none> <none> <none> | |
425 | [7] 0000/f000 <none> <none> <none> | |
426 | [8] 0000/efff <none> <none> <none> | |
427 | [9] 1001/1001 <none> <none> 1001/1001,-- | |
428 | [10] 3000/3000 <none> <none> <none> | |
429 | ||
430 | Each column is interpreted as follows. | |
431 | ||
542cc9bb | 432 | - Match: See the list below. |
df778240 | 433 | |
542cc9bb TG |
434 | - NXM: xxxx/yyyy means NXM_OF_VLAN_TCI_W with value xxxx and mask |
435 | yyyy. A mask of 0000 is equivalent to omitting | |
436 | NXM_OF_VLAN_TCI(_W), a mask of ffff is equivalent to | |
437 | NXM_OF_VLAN_TCI. | |
df778240 | 438 | |
053df7bd BP |
439 | - OF1.0 and OF1.1: wwww/x,yy/z means dl_vlan wwww, OFPFW_DL_VLAN x, |
440 | dl_vlan_pcp yy, and OFPFW_DL_VLAN_PCP z. If OFPFW_DL_VLAN or | |
441 | OFPFW_DL_VLAN_PCP is 1, the corresponding field value is | |
442 | wildcarded, otherwise it is matched. ? means that the given bits | |
443 | are ignored (their conventional values are 0000/x,00/0 in OF1.0, | |
444 | 0000/x,00/1 in OF1.1; x is never ignored). <none> means that the | |
445 | given match is not supported. | |
df778240 | 446 | |
542cc9bb TG |
447 | - OF1.2: xxxx/yyyy,zz means OXM_OF_VLAN_VID_W with value xxxx and |
448 | mask yyyy, and OXM_OF_VLAN_PCP (which is not maskable) with | |
449 | value zz. A mask of 0000 is equivalent to omitting | |
450 | OXM_OF_VLAN_VID(_W), a mask of ffff is equivalent to | |
451 | OXM_OF_VLAN_VID. -- means that OXM_OF_VLAN_PCP is omitted. | |
452 | <none> means that the given match is not supported. | |
df778240 BP |
453 | |
454 | The matches are: | |
455 | ||
456 | [1] Matches any packet, that is, one without an 802.1Q header or with | |
457 | an 802.1Q header with any TCI value. | |
458 | ||
459 | [2] Matches only packets without an 802.1Q header. | |
460 | ||
461 | NXM: Any match with (vlan_tci == 0) and (vlan_tci_mask & 0x1000) | |
462 | != 0 is equivalent to the one listed in the table. | |
463 | ||
464 | OF1.0: The spec doesn't define behavior if dl_vlan is set to | |
465 | 0xffff and OFPFW_DL_VLAN_PCP is not set. | |
466 | ||
467 | OF1.1: The spec says explicitly to ignore dl_vlan_pcp when | |
468 | dl_vlan is set to 0xffff. | |
469 | ||
470 | OF1.2: The spec doesn't say what should happen if (vlan_vid == 0) | |
471 | and (vlan_vid_mask & 0x1000) != 0 but (vlan_vid_mask != 0x1000), | |
472 | but it would be straightforward to also interpret as [2]. | |
473 | ||
474 | [3] Matches only packets that have an 802.1Q header with VID xxx (and | |
475 | any PCP). | |
476 | ||
477 | [4] Matches only packets that have an 802.1Q header with PCP y (and | |
478 | any VID). | |
479 | ||
480 | NXM: z is ((y << 1) | 1). | |
481 | ||
482 | OF1.0: The spec isn't very clear, but OVS implements it this way. | |
483 | ||
484 | OF1.2: Presumably other masks such that (vlan_vid_mask & 0x1fff) | |
485 | == 0x1000 would also work, but the spec doesn't define their | |
486 | behavior. | |
487 | ||
488 | [5] Matches only packets that have an 802.1Q header with VID xxx and | |
489 | PCP y. | |
490 | ||
491 | NXM: z is ((y << 1) | 1). | |
492 | ||
493 | OF1.2: Presumably other masks such that (vlan_vid_mask & 0x1fff) | |
494 | == 0x1fff would also work. | |
495 | ||
496 | [6] Matches packets with no 802.1Q header or with an 802.1Q header | |
497 | with a VID of 0. Only possible with NXM. | |
498 | ||
499 | [7] Matches packets with no 802.1Q header or with an 802.1Q header | |
500 | with a PCP of 0. Only possible with NXM. | |
501 | ||
502 | [8] Matches packets with no 802.1Q header or with an 802.1Q header | |
503 | with both VID and PCP of 0. Only possible with NXM. | |
504 | ||
505 | [9] Matches only packets that have an 802.1Q header with an | |
506 | odd-numbered VID (and any PCP). Only possible with NXM and | |
507 | OF1.2. (This is just an example; one can match on any desired | |
508 | VID bit pattern.) | |
509 | ||
510 | [10] Matches only packets that have an 802.1Q header with an | |
511 | odd-numbered PCP (and any VID). Only possible with NXM. (This | |
512 | is just an example; one can match on any desired VID bit | |
513 | pattern.) | |
514 | ||
515 | Additional notes: | |
516 | ||
542cc9bb TG |
517 | - OF1.2: The top three bits of OXM_OF_VLAN_VID are fixed to zero, |
518 | so bits 13, 14, and 15 in the masks listed in the table may be | |
519 | set to arbitrary values, as long as the corresponding value bits | |
520 | are also zero. The suggested ffff mask for [2], [3], and [5] | |
521 | allows a shorter OXM representation (the mask is omitted) than | |
522 | the minimal 1fff mask. | |
df778240 BP |
523 | |
524 | ||
f66b87de BP |
525 | Flow Cookies |
526 | ============ | |
527 | ||
528 | OpenFlow 1.0 and later versions have the concept of a "flow cookie", | |
529 | which is a 64-bit integer value attached to each flow. The treatment | |
530 | of the flow cookie has varied greatly across OpenFlow versions, | |
531 | however. | |
532 | ||
533 | In OpenFlow 1.0: | |
534 | ||
542cc9bb | 535 | - OFPFC_ADD set the cookie in the flow that it added. |
f66b87de | 536 | |
542cc9bb TG |
537 | - OFPFC_MODIFY and OFPFC_MODIFY_STRICT updated the cookie for |
538 | the flow or flows that it modified. | |
f66b87de | 539 | |
542cc9bb | 540 | - OFPST_FLOW messages included the flow cookie. |
f66b87de | 541 | |
542cc9bb TG |
542 | - OFPT_FLOW_REMOVED messages reported the cookie of the flow |
543 | that was removed. | |
f66b87de BP |
544 | |
545 | OpenFlow 1.1 made the following changes: | |
546 | ||
542cc9bb TG |
547 | - Flow mod operations OFPFC_MODIFY, OFPFC_MODIFY_STRICT, |
548 | OFPFC_DELETE, and OFPFC_DELETE_STRICT, plus flow stats | |
549 | requests and aggregate stats requests, gained the ability to | |
550 | match on flow cookies with an arbitrary mask. | |
f66b87de | 551 | |
542cc9bb TG |
552 | - OFPFC_MODIFY and OFPFC_MODIFY_STRICT were changed to add a |
553 | new flow, in the case of no match, only if the flow table | |
554 | modification operation did not match on the cookie field. | |
555 | (In OpenFlow 1.0, modify operations always added a new flow | |
556 | when there was no match.) | |
f66b87de | 557 | |
542cc9bb TG |
558 | - OFPFC_MODIFY and OFPFC_MODIFY_STRICT no longer updated flow |
559 | cookies. | |
f66b87de BP |
560 | |
561 | OpenFlow 1.2 made the following changes: | |
562 | ||
542cc9bb TG |
563 | - OFPC_MODIFY and OFPFC_MODIFY_STRICT were changed to never |
564 | add a new flow, regardless of whether the flow cookie was | |
565 | used for matching. | |
f66b87de BP |
566 | |
567 | Open vSwitch support for OpenFlow 1.0 implements the OpenFlow 1.0 | |
568 | behavior with the following extensions: | |
569 | ||
542cc9bb TG |
570 | - An NXM extension field NXM_NX_COOKIE(_W) allows the NXM |
571 | versions of OFPFC_MODIFY, OFPFC_MODIFY_STRICT, OFPFC_DELETE, | |
572 | and OFPFC_DELETE_STRICT flow_mods, plus flow stats requests | |
573 | and aggregate stats requests, to match on flow cookies with | |
574 | arbitrary masks. This is much like the equivalent OpenFlow | |
575 | 1.1 feature. | |
576 | ||
577 | - Like OpenFlow 1.1, OFPC_MODIFY and OFPFC_MODIFY_STRICT add a | |
578 | new flow if there is no match and the mask is zero (or not | |
579 | given). | |
580 | ||
581 | - The "cookie" field in OFPT_FLOW_MOD and NXT_FLOW_MOD messages | |
582 | is used as the cookie value for OFPFC_ADD commands, as | |
583 | described in OpenFlow 1.0. For OFPFC_MODIFY and | |
584 | OFPFC_MODIFY_STRICT commands, the "cookie" field is used as a | |
585 | new cookie for flows that match unless it is UINT64_MAX, in | |
586 | which case the flow's cookie is not updated. | |
587 | ||
588 | - NXT_PACKET_IN (the Nicira extended version of | |
589 | OFPT_PACKET_IN) reports the cookie of the rule that | |
590 | generated the packet, or all-1-bits if no rule generated the | |
591 | packet. (Older versions of OVS used all-0-bits instead of | |
592 | all-1-bits.) | |
f66b87de | 593 | |
623e1caf JP |
594 | The following table shows the handling of different protocols when |
595 | receiving OFPFC_MODIFY and OFPFC_MODIFY_STRICT messages. A mask of 0 | |
596 | indicates either an explicit mask of zero or an implicit one by not | |
597 | specifying the NXM_NX_COOKIE(_W) field. | |
598 | ||
542cc9bb | 599 | ``` |
623e1caf JP |
600 | Match Update Add on miss Add on miss |
601 | cookie cookie mask!=0 mask==0 | |
602 | ====== ====== =========== =========== | |
603 | OpenFlow 1.0 no yes <always add on miss> | |
604 | OpenFlow 1.1 yes no no yes | |
605 | OpenFlow 1.2 yes no no no | |
606 | NXM yes yes* no yes | |
607 | ||
608 | * Updates the flow's cookie unless the "cookie" field is UINT64_MAX. | |
542cc9bb | 609 | ``` |
f66b87de | 610 | |
66abb12b BP |
611 | Multiple Table Support |
612 | ====================== | |
613 | ||
614 | OpenFlow 1.0 has only rudimentary support for multiple flow tables. | |
615 | Notably, OpenFlow 1.0 does not allow the controller to specify the | |
616 | flow table to which a flow is to be added. Open vSwitch adds an | |
617 | extension for this purpose, which is enabled on a per-OpenFlow | |
618 | connection basis using the NXT_FLOW_MOD_TABLE_ID message. When the | |
619 | extension is enabled, the upper 8 bits of the 'command' member in an | |
620 | OFPT_FLOW_MOD or NXT_FLOW_MOD message designates the table to which a | |
621 | flow is to be added. | |
622 | ||
623 | The Open vSwitch software switch implementation offers 255 flow | |
624 | tables. On packet ingress, only the first flow table (table 0) is | |
625 | searched, and the contents of the remaining tables are not considered | |
626 | in any way. Tables other than table 0 only come into play when an | |
627 | NXAST_RESUBMIT_TABLE action specifies another table to search. | |
628 | ||
629 | Tables 128 and above are reserved for use by the switch itself. | |
630 | Controllers should use only tables 0 through 127. | |
631 | ||
632 | ||
82c22d34 BP |
633 | OFPTC_* Table Configuration |
634 | =========================== | |
635 | ||
636 | This section covers the history of the OFPTC_* table configuration | |
637 | bits across OpenFlow versions. | |
638 | ||
639 | OpenFlow 1.0 flow tables had fixed configurations. | |
640 | ||
641 | OpenFlow 1.1 enabled controllers to configure behavior upon flow table | |
642 | miss and added the OFPTC_MISS_* constants for that purpose. OFPTC_* | |
643 | did not control anything else but it was nevertheless conceptualized | |
644 | as a set of bit-fields instead of an enum. OF1.1 added the | |
645 | OFPT_TABLE_MOD message to set OFPTC_MISS_* for a flow table and added | |
646 | the 'config' field to the OFPST_TABLE reply to report the current | |
647 | setting. | |
648 | ||
649 | OpenFlow 1.2 did not change anything in this regard. | |
650 | ||
651 | OpenFlow 1.3 switched to another means to changing flow table miss | |
652 | behavior and deprecated OFPTC_MISS_* without adding any more OFPTC_* | |
653 | constants. This meant that OFPT_TABLE_MOD now had no purpose at all, | |
654 | but OF1.3 kept it around "for backward compatibility with older and | |
655 | newer versions of the specification." At the same time, OF1.3 | |
656 | introduced a new message OFPMP_TABLE_FEATURES that included a field | |
657 | 'config' documented as reporting the OFPTC_* values set with | |
658 | OFPT_TABLE_MOD; of course this served no real purpose because no | |
659 | OFPTC_* values are defined. OF1.3 did remove the OFPTC_* field from | |
660 | OFPMP_TABLE (previously named OFPST_TABLE). | |
661 | ||
662 | OpenFlow 1.4 defined two new OFPTC_* constants, OFPTC_EVICTION and | |
663 | OFPTC_VACANCY_EVENTS, using bits that did not overlap with | |
664 | OFPTC_MISS_* even though those bits had not been defined since OF1.2. | |
665 | OFPT_TABLE_MOD still controlled these settings. The field for OFPTC_* | |
666 | values in OFPMP_TABLE_FEATURES was renamed from 'config' to | |
667 | 'capabilities' and documented as reporting the flags that are | |
668 | supported in a OFPT_TABLE_MOD message. The OFPMP_TABLE_DESC message | |
669 | newly added in OF1.4 reported the OFPTC_* setting. | |
670 | ||
671 | OpenFlow 1.5 did not change anything in this regard. | |
672 | ||
673 | The following table summarizes. The columns say: | |
674 | ||
675 | - OpenFlow version(s). | |
676 | ||
677 | - The OFPTC_* flags defined in those versions. | |
678 | ||
679 | - Whether OFPT_TABLE_MOD can modify OFPTC_* flags. | |
680 | ||
681 | - Whether OFPST_TABLE/OFPMP_TABLE reports the OFPTC_* flags. | |
682 | ||
683 | - What OFPMP_TABLE_FEATURES reports (if it exists): either the | |
684 | current configuration or the switch's capabilities. | |
685 | ||
686 | - Whether OFPMP_TABLE_DESC reports the current configuration. | |
687 | ||
688 | OpenFlow OFPTC_* flags TABLE_MOD stats? TABLE_FEATURES TABLE_DESC | |
689 | --------- ----------------------- --------- ------ -------------- ---------- | |
690 | OF1.0 none no[*][+] no[*] nothing[*][+] no[*][+] | |
691 | OF1.1/1.2 MISS_* yes yes nothing[+] no[+] | |
692 | OF1.3 none yes[*] no[*] config[*] no[*][+] | |
693 | OF1.4/1.5 EVICTION/VACANCY_EVENTS yes no capabilities yes | |
694 | ||
695 | [*] Nothing to report/change anyway. | |
696 | ||
697 | [+] No such message. | |
698 | ||
699 | ||
d31f1109 JP |
700 | IPv6 |
701 | ==== | |
702 | ||
703 | Open vSwitch supports stateless handling of IPv6 packets. Flows can be | |
704 | written to support matching TCP, UDP, and ICMPv6 headers within an IPv6 | |
685a51a5 JP |
705 | packet. Deeper matching of some Neighbor Discovery messages is also |
706 | supported. | |
d31f1109 JP |
707 | |
708 | IPv6 was not designed to interact well with middle-boxes. This, | |
709 | combined with Open vSwitch's stateless nature, have affected the | |
710 | processing of IPv6 traffic, which is detailed below. | |
711 | ||
712 | Extension Headers | |
713 | ----------------- | |
714 | ||
715 | The base IPv6 header is incredibly simple with the intention of only | |
716 | containing information relevant for routing packets between two | |
717 | endpoints. IPv6 relies heavily on the use of extension headers to | |
718 | provide any other functionality. Unfortunately, the extension headers | |
719 | were designed in such a way that it is impossible to move to the next | |
720 | header (including the layer-4 payload) unless the current header is | |
721 | understood. | |
722 | ||
723 | Open vSwitch will process the following extension headers and continue | |
724 | to the next header: | |
725 | ||
542cc9bb TG |
726 | * Fragment (see the next section) |
727 | * AH (Authentication Header) | |
728 | * Hop-by-Hop Options | |
729 | * Routing | |
730 | * Destination Options | |
d31f1109 JP |
731 | |
732 | When a header is encountered that is not in that list, it is considered | |
733 | "terminal". A terminal header's IPv6 protocol value is stored in | |
734 | "nw_proto" for matching purposes. If a terminal header is TCP, UDP, or | |
735 | ICMPv6, the packet will be further processed in an attempt to extract | |
736 | layer-4 information. | |
737 | ||
738 | Fragments | |
739 | --------- | |
740 | ||
741 | IPv6 requires that every link in the internet have an MTU of 1280 octets | |
742 | or greater (RFC 2460). As such, a terminal header (as described above in | |
743 | "Extension Headers") in the first fragment should generally be | |
744 | reachable. In this case, the terminal header's IPv6 protocol type is | |
745 | stored in the "nw_proto" field for matching purposes. If a terminal | |
746 | header cannot be found in the first fragment (one with a fragment offset | |
747 | of zero), the "nw_proto" field is set to 0. Subsequent fragments (those | |
748 | with a non-zero fragment offset) have the "nw_proto" field set to the | |
749 | IPv6 protocol type for fragments (44). | |
750 | ||
751 | Jumbograms | |
752 | ---------- | |
753 | ||
754 | An IPv6 jumbogram (RFC 2675) is a packet containing a payload longer | |
755 | than 65,535 octets. A jumbogram is only relevant in subnets with a link | |
756 | MTU greater than 65,575 octets, and are not required to be supported on | |
757 | nodes that do not connect to link with such large MTUs. Currently, Open | |
758 | vSwitch doesn't process jumbograms. | |
759 | ||
760 | ||
946350dc BP |
761 | In-Band Control |
762 | =============== | |
763 | ||
56e9c3b9 BP |
764 | Motivation |
765 | ---------- | |
766 | ||
767 | An OpenFlow switch must establish and maintain a TCP network | |
768 | connection to its controller. There are two basic ways to categorize | |
769 | the network that this connection traverses: either it is completely | |
770 | separate from the one that the switch is otherwise controlling, or its | |
771 | path may overlap the network that the switch controls. We call the | |
772 | former case "out-of-band control", the latter case "in-band control". | |
773 | ||
774 | Out-of-band control has the following benefits: | |
775 | ||
542cc9bb TG |
776 | - Simplicity: Out-of-band control slightly simplifies the switch |
777 | implementation. | |
56e9c3b9 | 778 | |
542cc9bb TG |
779 | - Reliability: Excessive switch traffic volume cannot interfere |
780 | with control traffic. | |
56e9c3b9 | 781 | |
542cc9bb TG |
782 | - Integrity: Machines not on the control network cannot |
783 | impersonate a switch or a controller. | |
56e9c3b9 | 784 | |
542cc9bb TG |
785 | - Confidentiality: Machines not on the control network cannot |
786 | snoop on control traffic. | |
56e9c3b9 BP |
787 | |
788 | In-band control, on the other hand, has the following advantages: | |
789 | ||
542cc9bb TG |
790 | - No dedicated port: There is no need to dedicate a physical |
791 | switch port to control, which is important on switches that have | |
792 | few ports (e.g. wireless routers, low-end embedded platforms). | |
56e9c3b9 | 793 | |
542cc9bb TG |
794 | - No dedicated network: There is no need to build and maintain a |
795 | separate control network. This is important in many | |
796 | environments because it reduces proliferation of switches and | |
797 | wiring. | |
56e9c3b9 BP |
798 | |
799 | Open vSwitch supports both out-of-band and in-band control. This | |
800 | section describes the principles behind in-band control. See the | |
801 | description of the Controller table in ovs-vswitchd.conf.db(5) to | |
802 | configure OVS for in-band control. | |
803 | ||
804 | Principles | |
805 | ---------- | |
806 | ||
807 | The fundamental principle of in-band control is that an OpenFlow | |
808 | switch must recognize and switch control traffic without involving the | |
809 | OpenFlow controller. All the details of implementing in-band control | |
810 | are special cases of this principle. | |
811 | ||
812 | The rationale for this principle is simple. If the switch does not | |
813 | handle in-band control traffic itself, then it will be caught in a | |
814 | contradiction: it must contact the controller, but it cannot, because | |
815 | only the controller can set up the flows that are needed to contact | |
816 | the controller. | |
817 | ||
818 | The following points describe important special cases of this | |
819 | principle. | |
820 | ||
542cc9bb TG |
821 | - In-band control must be implemented regardless of whether the |
822 | switch is connected. | |
823 | ||
824 | It is tempting to implement the in-band control rules only when | |
825 | the switch is not connected to the controller, using the | |
826 | reasoning that the controller should have complete control once | |
827 | it has established a connection with the switch. | |
828 | ||
829 | This does not work in practice. Consider the case where the | |
830 | switch is connected to the controller. Occasionally it can | |
831 | happen that the controller forgets or otherwise needs to obtain | |
832 | the MAC address of the switch. To do so, the controller sends a | |
833 | broadcast ARP request. A switch that implements the in-band | |
834 | control rules only when it is disconnected will then send an | |
835 | OFPT_PACKET_IN message up to the controller. The controller will | |
836 | be unable to respond, because it does not know the MAC address of | |
837 | the switch. This is a deadlock situation that can only be | |
838 | resolved by the switch noticing that its connection to the | |
839 | controller has hung and reconnecting. | |
840 | ||
841 | - In-band control must override flows set up by the controller. | |
842 | ||
843 | It is reasonable to assume that flows set up by the OpenFlow | |
844 | controller should take precedence over in-band control, on the | |
845 | basis that the controller should be in charge of the switch. | |
846 | ||
847 | Again, this does not work in practice. Reasonable controller | |
848 | implementations may set up a "last resort" fallback rule that | |
849 | wildcards every field and, e.g., sends it up to the controller or | |
850 | discards it. If a controller does that, then it will isolate | |
851 | itself from the switch. | |
852 | ||
853 | - The switch must recognize all control traffic. | |
854 | ||
855 | The fundamental principle of in-band control states, in part, | |
856 | that a switch must recognize control traffic without involving | |
857 | the OpenFlow controller. More specifically, the switch must | |
858 | recognize *all* control traffic. "False negatives", that is, | |
859 | packets that constitute control traffic but that the switch does | |
860 | not recognize as control traffic, lead to control traffic storms. | |
861 | ||
862 | Consider an OpenFlow switch that only recognizes control packets | |
863 | sent to or from that switch. Now suppose that two switches of | |
864 | this type, named A and B, are connected to ports on an Ethernet | |
865 | hub (not a switch) and that an OpenFlow controller is connected | |
866 | to a third hub port. In this setup, control traffic sent by | |
867 | switch A will be seen by switch B, which will send it to the | |
868 | controller as part of an OFPT_PACKET_IN message. Switch A will | |
869 | then see the OFPT_PACKET_IN message's packet, re-encapsulate it | |
870 | in another OFPT_PACKET_IN, and send it to the controller. Switch | |
871 | B will then see that OFPT_PACKET_IN, and so on in an infinite | |
872 | loop. | |
873 | ||
874 | Incidentally, the consequences of "false positives", where | |
875 | packets that are not control traffic are nevertheless recognized | |
876 | as control traffic, are much less severe. The controller will | |
877 | not be able to control their behavior, but the network will | |
878 | remain in working order. False positives do constitute a | |
879 | security problem. | |
880 | ||
881 | - The switch should use echo-requests to detect disconnection. | |
882 | ||
883 | TCP will notice that a connection has hung, but this can take a | |
884 | considerable amount of time. For example, with default settings | |
885 | the Linux kernel TCP implementation will retransmit for between | |
886 | 13 and 30 minutes, depending on the connection's retransmission | |
887 | timeout, according to kernel documentation. This is far too long | |
888 | for a switch to be disconnected, so an OpenFlow switch should | |
889 | implement its own connection timeout. OpenFlow OFPT_ECHO_REQUEST | |
890 | messages are the best way to do this, since they test the | |
891 | OpenFlow connection itself. | |
56e9c3b9 BP |
892 | |
893 | Implementation | |
894 | -------------- | |
895 | ||
896 | This section describes how Open vSwitch implements in-band control. | |
897 | Correctly implementing in-band control has proven difficult due to its | |
898 | many subtleties, and has thus gone through many iterations. Please | |
899 | read through and understand the reasoning behind the chosen rules | |
900 | before making modifications. | |
901 | ||
902 | Open vSwitch implements in-band control as "hidden" flows, that is, | |
903 | flows that are not visible through OpenFlow, and at a higher priority | |
904 | than wildcarded flows can be set up through OpenFlow. This is done so | |
905 | that the OpenFlow controller cannot interfere with them and possibly | |
906 | break connectivity with its switches. It is possible to see all | |
907 | flows, including in-band ones, with the ovs-appctl "bridge/dump-flows" | |
908 | command. | |
946350dc BP |
909 | |
910 | The Open vSwitch implementation of in-band control can hide traffic to | |
911 | arbitrary "remotes", where each remote is one TCP port on one IP address. | |
912 | Currently the remotes are automatically configured as the in-band OpenFlow | |
913 | controllers plus the OVSDB managers, if any. (The latter is a requirement | |
914 | because OVSDB managers are responsible for configuring OpenFlow controllers, | |
915 | so if the manager cannot be reached then OpenFlow cannot be reconfigured.) | |
916 | ||
917 | The following rules (with the OFPP_NORMAL action) are set up on any bridge | |
918 | that has any remotes: | |
919 | ||
920 | (a) DHCP requests sent from the local port. | |
921 | (b) ARP replies to the local port's MAC address. | |
922 | (c) ARP requests from the local port's MAC address. | |
923 | ||
924 | In-band also sets up the following rules for each unique next-hop MAC | |
925 | address for the remotes' IPs (the "next hop" is either the remote | |
926 | itself, if it is on a local subnet, or the gateway to reach the remote): | |
927 | ||
928 | (d) ARP replies to the next hop's MAC address. | |
929 | (e) ARP requests from the next hop's MAC address. | |
930 | ||
931 | In-band also sets up the following rules for each unique remote IP address: | |
932 | ||
933 | (f) ARP replies containing the remote's IP address as a target. | |
934 | (g) ARP requests containing the remote's IP address as a source. | |
935 | ||
936 | In-band also sets up the following rules for each unique remote (IP,port) | |
937 | pair: | |
938 | ||
939 | (h) TCP traffic to the remote's IP and port. | |
940 | (i) TCP traffic from the remote's IP and port. | |
941 | ||
942 | The goal of these rules is to be as narrow as possible to allow a | |
943 | switch to join a network and be able to communicate with the | |
944 | remotes. As mentioned earlier, these rules have higher priority | |
945 | than the controller's rules, so if they are too broad, they may | |
946 | prevent the controller from implementing its policy. As such, | |
947 | in-band actively monitors some aspects of flow and packet processing | |
948 | so that the rules can be made more precise. | |
949 | ||
950 | In-band control monitors attempts to add flows into the datapath that | |
951 | could interfere with its duties. The datapath only allows exact | |
952 | match entries, so in-band control is able to be very precise about | |
953 | the flows it prevents. Flows that miss in the datapath are sent to | |
954 | userspace to be processed, so preventing these flows from being | |
955 | cached in the "fast path" does not affect correctness. The only type | |
956 | of flow that is currently prevented is one that would prevent DHCP | |
957 | replies from being seen by the local port. For example, a rule that | |
958 | forwarded all DHCP traffic to the controller would not be allowed, | |
959 | but one that forwarded to all ports (including the local port) would. | |
960 | ||
961 | As mentioned earlier, packets that miss in the datapath are sent to | |
962 | the userspace for processing. The userspace has its own flow table, | |
963 | the "classifier", so in-band checks whether any special processing | |
964 | is needed before the classifier is consulted. If a packet is a DHCP | |
965 | response to a request from the local port, the packet is forwarded to | |
966 | the local port, regardless of the flow table. Note that this requires | |
967 | L7 processing of DHCP replies to determine whether the 'chaddr' field | |
968 | matches the MAC address of the local port. | |
969 | ||
970 | It is interesting to note that for an L3-based in-band control | |
971 | mechanism, the majority of rules are devoted to ARP traffic. At first | |
972 | glance, some of these rules appear redundant. However, each serves an | |
973 | important role. First, in order to determine the MAC address of the | |
974 | remote side (controller or gateway) for other ARP rules, we must allow | |
975 | ARP traffic for our local port with rules (b) and (c). If we are | |
976 | between a switch and its connection to the remote, we have to | |
977 | allow the other switch's ARP traffic to through. This is done with | |
978 | rules (d) and (e), since we do not know the addresses of the other | |
979 | switches a priori, but do know the remote's or gateway's. Finally, | |
980 | if the remote is running in a local guest VM that is not reached | |
981 | through the local port, the switch that is connected to the VM must | |
982 | allow ARP traffic based on the remote's IP address, since it will | |
983 | not know the MAC address of the local port that is sending the traffic | |
984 | or the MAC address of the remote in the guest VM. | |
985 | ||
986 | With a few notable exceptions below, in-band should work in most | |
1c38055d | 987 | network setups. The following are considered "supported" in the |
946350dc BP |
988 | current implementation: |
989 | ||
542cc9bb TG |
990 | - Locally Connected. The switch and remote are on the same |
991 | subnet. This uses rules (a), (b), (c), (h), and (i). | |
992 | ||
993 | - Reached through Gateway. The switch and remote are on | |
994 | different subnets and must go through a gateway. This uses | |
995 | rules (a), (b), (c), (h), and (i). | |
996 | ||
997 | - Between Switch and Remote. This switch is between another | |
998 | switch and the remote, and we want to allow the other | |
999 | switch's traffic through. This uses rules (d), (e), (h), and | |
1000 | (i). It uses (b) and (c) indirectly in order to know the MAC | |
1001 | address for rules (d) and (e). Note that DHCP for the other | |
1002 | switch will not work unless an OpenFlow controller explicitly lets this | |
1003 | switch pass the traffic. | |
1004 | ||
1005 | - Between Switch and Gateway. This switch is between another | |
1006 | switch and the gateway, and we want to allow the other switch's | |
1007 | traffic through. This uses the same rules and logic as the | |
1008 | "Between Switch and Remote" configuration described earlier. | |
1009 | ||
1010 | - Remote on Local VM. The remote is a guest VM on the | |
1011 | system running in-band control. This uses rules (a), (b), (c), | |
1012 | (h), and (i). | |
1013 | ||
1014 | - Remote on Local VM with Different Networks. The remote | |
1015 | is a guest VM on the system running in-band control, but the | |
1016 | local port is not used to connect to the remote. For | |
1017 | example, an IP address is configured on eth0 of the switch. The | |
1018 | remote's VM is connected through eth1 of the switch, but an | |
1019 | IP address has not been configured for that port on the switch. | |
1020 | As such, the switch will use eth0 to connect to the remote, | |
1021 | and eth1's rules about the local port will not work. In the | |
1022 | example, the switch attached to eth0 would use rules (a), (b), | |
1023 | (c), (h), and (i) on eth0. The switch attached to eth1 would use | |
1024 | rules (f), (g), (h), and (i). | |
946350dc BP |
1025 | |
1026 | The following are explicitly *not* supported by in-band control: | |
1027 | ||
542cc9bb TG |
1028 | - Specify Remote by Name. Currently, the remote must be |
1029 | identified by IP address. A naive approach would be to permit | |
1030 | all DNS traffic. Unfortunately, this would prevent the | |
1031 | controller from defining any policy over DNS. Since switches | |
1032 | that are located behind us need to connect to the remote, | |
1033 | in-band cannot simply add a rule that allows DNS traffic from | |
1034 | the local port. The "correct" way to support this is to parse | |
1035 | DNS requests to allow all traffic related to a request for the | |
1036 | remote's name through. Due to the potential security | |
1037 | problems and amount of processing, we decided to hold off for | |
1038 | the time-being. | |
1039 | ||
1040 | - Differing Remotes for Switches. All switches must know | |
1041 | the L3 addresses for all the remotes that other switches | |
1042 | may use, since rules need to be set up to allow traffic related | |
1043 | to those remotes through. See rules (f), (g), (h), and (i). | |
1044 | ||
1045 | - Differing Routes for Switches. In order for the switch to | |
1046 | allow other switches to connect to a remote through a | |
1047 | gateway, it allows the gateway's traffic through with rules (d) | |
1048 | and (e). If the routes to the remote differ for the two | |
1049 | switches, we will not know the MAC address of the alternate | |
1050 | gateway. | |
946350dc BP |
1051 | |
1052 | ||
f25d0cf3 BP |
1053 | Action Reproduction |
1054 | =================== | |
1055 | ||
1056 | It seems likely that many controllers, at least at startup, use the | |
1057 | OpenFlow "flow statistics" request to obtain existing flows, then | |
1058 | compare the flows' actions against the actions that they expect to | |
1059 | find. Before version 1.8.0, Open vSwitch always returned exact, | |
1060 | byte-for-byte copies of the actions that had been added to the flow | |
1061 | table. The current version of Open vSwitch does not always do this in | |
1062 | some exceptional cases. This section lists the exceptions that | |
1063 | controller authors must keep in mind if they compare actual actions | |
1064 | against desired actions in a bytewise fashion: | |
1065 | ||
542cc9bb TG |
1066 | - Open vSwitch zeros padding bytes in action structures, |
1067 | regardless of their values when the flows were added. | |
f25d0cf3 | 1068 | |
542cc9bb TG |
1069 | - Open vSwitch "normalizes" the instructions in OpenFlow 1.1 |
1070 | (and later) in the following way: | |
d01c980f | 1071 | |
542cc9bb TG |
1072 | * OVS sorts the instructions into the following order: |
1073 | Apply-Actions, Clear-Actions, Write-Actions, | |
1074 | Write-Metadata, Goto-Table. | |
d01c980f | 1075 | |
542cc9bb TG |
1076 | * OVS drops Apply-Actions instructions that have empty |
1077 | action lists. | |
d01c980f | 1078 | |
542cc9bb TG |
1079 | * OVS drops Write-Actions instructions that have empty |
1080 | action sets. | |
d01c980f | 1081 | |
f25d0cf3 BP |
1082 | Please report other discrepancies, if you notice any, so that we can |
1083 | fix or document them. | |
1084 | ||
1085 | ||
d31f1109 JP |
1086 | Suggestions |
1087 | =========== | |
1088 | ||
1089 | Suggestions to improve Open vSwitch are welcome at discuss@openvswitch.org. |