]>
Commit | Line | Data |
---|---|---|
542cc9bb TG |
1 | Design Decisions In Open vSwitch |
2 | ================================ | |
d31f1109 JP |
3 | |
4 | This document describes design decisions that went into implementing | |
5 | Open vSwitch. While we believe these to be reasonable decisions, it is | |
6 | impossible to predict how Open vSwitch will be used in all environments. | |
7 | Understanding assumptions made by Open vSwitch is critical to a | |
8 | successful deployment. The end of this document contains contact | |
9 | information that can be used to let us know how we can make Open vSwitch | |
10 | more generally useful. | |
11 | ||
80d5aefd BP |
12 | Asynchronous Messages |
13 | ===================== | |
14 | ||
15 | Over time, Open vSwitch has added many knobs that control whether a | |
16 | given controller receives OpenFlow asynchronous messages. This | |
17 | section describes how all of these features interact. | |
18 | ||
19 | First, a service controller never receives any asynchronous messages | |
4550b647 MM |
20 | unless it changes its miss_send_len from the service controller |
21 | default of zero in one of the following ways: | |
22 | ||
542cc9bb | 23 | - Sending an OFPT_SET_CONFIG message with nonzero miss_send_len. |
4550b647 | 24 | |
542cc9bb TG |
25 | - Sending any NXT_SET_ASYNC_CONFIG message: as a side effect, this |
26 | message changes the miss_send_len to | |
27 | OFP_DEFAULT_MISS_SEND_LEN (128) for service controllers. | |
80d5aefd BP |
28 | |
29 | Second, OFPT_FLOW_REMOVED and NXT_FLOW_REMOVED messages are generated | |
30 | only if the flow that was removed had the OFPFF_SEND_FLOW_REM flag | |
31 | set. | |
32 | ||
a7349929 BP |
33 | Third, OFPT_PACKET_IN and NXT_PACKET_IN messages are sent only to |
34 | OpenFlow controller connections that have the correct connection ID | |
35 | (see "struct nx_controller_id" and "struct nx_action_controller"): | |
36 | ||
542cc9bb TG |
37 | - For packet-in messages generated by a NXAST_CONTROLLER action, |
38 | the controller ID specified in the action. | |
a7349929 | 39 | |
542cc9bb TG |
40 | - For other packet-in messages, controller ID zero. (This is the |
41 | default ID when an OpenFlow controller does not configure one.) | |
a7349929 | 42 | |
80d5aefd BP |
43 | Finally, Open vSwitch consults a per-connection table indexed by the |
44 | message type, reason code, and current role. The following table | |
45 | shows how this table is initialized by default when an OpenFlow | |
46 | connection is made. An entry labeled "yes" means that the message is | |
47 | sent, an entry labeled "---" means that the message is suppressed. | |
48 | ||
542cc9bb | 49 | ``` |
80d5aefd BP |
50 | master/ |
51 | message and reason code other slave | |
52 | ---------------------------------------- ------- ----- | |
53 | OFPT_PACKET_IN / NXT_PACKET_IN | |
54 | OFPR_NO_MATCH yes --- | |
55 | OFPR_ACTION yes --- | |
56 | OFPR_INVALID_TTL --- --- | |
029ca940 | 57 | OFPR_ACTION_SET (OF1.4+) yes --- |
3a11fd5b | 58 | OFPR_GROUP (OF1.4+) yes --- |
80d5aefd BP |
59 | |
60 | OFPT_FLOW_REMOVED / NXT_FLOW_REMOVED | |
61 | OFPRR_IDLE_TIMEOUT yes --- | |
62 | OFPRR_HARD_TIMEOUT yes --- | |
63 | OFPRR_DELETE yes --- | |
98090482 NR |
64 | OFPRR_GROUP_DELETE (OF1.4+) yes --- |
65 | OFPRR_METER_DELETE (OF1.4+) yes --- | |
66 | OFPRR_EVICTION (OF1.4+) yes --- | |
80d5aefd BP |
67 | |
68 | OFPT_PORT_STATUS | |
69 | OFPPR_ADD yes yes | |
70 | OFPPR_DELETE yes yes | |
71 | OFPPR_MODIFY yes yes | |
98090482 NR |
72 | |
73 | OFPT_ROLE_REQUEST / OFPT_ROLE_REPLY (OF1.4+) | |
74 | OFPCRR_MASTER_REQUEST --- --- | |
75 | OFPCRR_CONFIG --- --- | |
76 | OFPCRR_EXPERIMENTER --- --- | |
77 | ||
78 | OFPT_TABLE_STATUS (OF1.4+) | |
79 | OFPTR_VACANCY_DOWN --- --- | |
80 | OFPTR_VACANCY_UP --- --- | |
81 | ||
82 | OFPT_REQUESTFORWARD (OF1.4+) | |
83 | OFPRFR_GROUP_MOD --- --- | |
84 | OFPRFR_METER_MOD --- --- | |
542cc9bb | 85 | ``` |
80d5aefd BP |
86 | |
87 | The NXT_SET_ASYNC_CONFIG message directly sets all of the values in | |
88 | this table for the current connection. The | |
89 | OFPC_INVALID_TTL_TO_CONTROLLER bit in the OFPT_SET_CONFIG message | |
90 | controls the setting for OFPR_INVALID_TTL for the "master" role. | |
91 | ||
92 | ||
93 | OFPAT_ENQUEUE | |
94 | ============= | |
82172632 EJ |
95 | |
96 | The OpenFlow 1.0 specification requires the output port of the OFPAT_ENQUEUE | |
97 | action to "refer to a valid physical port (i.e. < OFPP_MAX) or OFPP_IN_PORT". | |
98 | Although OFPP_LOCAL is not less than OFPP_MAX, it is an 'internal' port which | |
99 | can have QoS applied to it in Linux. Since we allow the OFPAT_ENQUEUE to apply | |
100 | to 'internal' ports whose port numbers are less than OFPP_MAX, we interpret | |
101 | OFPP_LOCAL as a physical port and support OFPAT_ENQUEUE on it as well. | |
102 | ||
d31f1109 | 103 | |
12442ec5 BP |
104 | OFPT_FLOW_MOD |
105 | ============= | |
106 | ||
3432cb4e BP |
107 | The OpenFlow specification for the behavior of OFPT_FLOW_MOD is |
108 | confusing. The following tables summarize the Open vSwitch | |
12442ec5 BP |
109 | implementation of its behavior in the following categories: |
110 | ||
542cc9bb TG |
111 | - "match on priority": Whether the flow_mod acts only on flows |
112 | whose priority matches that included in the flow_mod message. | |
12442ec5 | 113 | |
542cc9bb TG |
114 | - "match on out_port": Whether the flow_mod acts only on flows |
115 | that output to the out_port included in the flow_mod message (if | |
116 | out_port is not OFPP_NONE). OpenFlow 1.1 and later have a | |
117 | similar feature (not listed separately here) for out_group. | |
3432cb4e | 118 | |
542cc9bb TG |
119 | - "match on flow_cookie": Whether the flow_mod acts only on flows |
120 | whose flow_cookie matches an optional controller-specified value | |
121 | and mask. | |
12442ec5 | 122 | |
542cc9bb TG |
123 | - "updates flow_cookie": Whether the flow_mod changes the |
124 | flow_cookie of the flow or flows that it matches to the | |
125 | flow_cookie included in the flow_mod message. | |
12442ec5 | 126 | |
542cc9bb TG |
127 | - "updates OFPFF_ flags": Whether the flow_mod changes the |
128 | OFPFF_SEND_FLOW_REM flag of the flow or flows that it matches to | |
129 | the setting included in the flags of the flow_mod message. | |
12442ec5 | 130 | |
542cc9bb TG |
131 | - "honors OFPFF_CHECK_OVERLAP": Whether the OFPFF_CHECK_OVERLAP |
132 | flag in the flow_mod is significant. | |
12442ec5 | 133 | |
542cc9bb TG |
134 | - "updates idle_timeout" and "updates hard_timeout": Whether the |
135 | idle_timeout and hard_timeout in the flow_mod, respectively, | |
136 | have an effect on the flow or flows matched by the flow_mod. | |
12442ec5 | 137 | |
542cc9bb TG |
138 | - "updates idle timer": Whether the flow_mod resets the per-flow |
139 | timer that measures how long a flow has been idle. | |
12442ec5 | 140 | |
542cc9bb TG |
141 | - "updates hard timer": Whether the flow_mod resets the per-flow |
142 | timer that measures how long it has been since a flow was | |
143 | modified. | |
12442ec5 | 144 | |
542cc9bb TG |
145 | - "zeros counters": Whether the flow_mod resets per-flow packet |
146 | and byte counters to zero. | |
12442ec5 | 147 | |
542cc9bb TG |
148 | - "may add a new flow": Whether the flow_mod may add a new flow to |
149 | the flow table. (Obviously this is always true for "add" | |
150 | commands but in some OpenFlow versions "modify" and | |
151 | "modify-strict" can also add new flows.) | |
3432cb4e | 152 | |
542cc9bb TG |
153 | - "sends flow_removed message": Whether the flow_mod generates a |
154 | flow_removed message for the flow or flows that it affects. | |
12442ec5 BP |
155 | |
156 | An entry labeled "yes" means that the flow mod type does have the | |
157 | indicated behavior, "---" means that it does not, an empty cell means | |
158 | that the property is not applicable, and other values are explained | |
159 | below the table. | |
160 | ||
3432cb4e BP |
161 | OpenFlow 1.0 |
162 | ------------ | |
163 | ||
542cc9bb | 164 | ``` |
12442ec5 BP |
165 | MODIFY DELETE |
166 | ADD MODIFY STRICT DELETE STRICT | |
167 | === ====== ====== ====== ====== | |
3432cb4e | 168 | match on priority yes --- yes --- yes |
906087ee | 169 | match on out_port --- --- --- yes yes |
3432cb4e BP |
170 | match on flow_cookie --- --- --- --- --- |
171 | match on table_id --- --- --- --- --- | |
172 | controller chooses table_id --- --- --- | |
12442ec5 BP |
173 | updates flow_cookie yes yes yes |
174 | updates OFPFF_SEND_FLOW_REM yes + + | |
175 | honors OFPFF_CHECK_OVERLAP yes + + | |
176 | updates idle_timeout yes + + | |
177 | updates hard_timeout yes + + | |
178 | resets idle timer yes + + | |
179 | resets hard timer yes yes yes | |
180 | zeros counters yes + + | |
3432cb4e BP |
181 | may add a new flow yes yes yes |
182 | sends flow_removed message --- --- --- % % | |
183 | ||
184 | (+) "modify" and "modify-strict" only take these actions when they | |
185 | create a new flow, not when they update an existing flow. | |
186 | ||
187 | (%) "delete" and "delete_strict" generates a flow_removed message if | |
188 | the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set. | |
189 | (Each controller can separately control whether it wants to | |
190 | receive the generated messages.) | |
542cc9bb | 191 | ``` |
3432cb4e BP |
192 | |
193 | OpenFlow 1.1 | |
194 | ------------ | |
195 | ||
196 | OpenFlow 1.1 makes these changes: | |
197 | ||
542cc9bb TG |
198 | - The controller now must specify the table_id of the flow match |
199 | searched and into which a flow may be inserted. Behavior for a | |
200 | table_id of 255 is undefined. | |
3432cb4e | 201 | |
542cc9bb | 202 | - A flow_mod, except an "add", can now match on the flow_cookie. |
3432cb4e | 203 | |
542cc9bb TG |
204 | - When a flow_mod matches on the flow_cookie, "modify" and |
205 | "modify-strict" never insert a new flow. | |
3432cb4e | 206 | |
542cc9bb | 207 | ``` |
3432cb4e BP |
208 | MODIFY DELETE |
209 | ADD MODIFY STRICT DELETE STRICT | |
210 | === ====== ====== ====== ====== | |
211 | match on priority yes --- yes --- yes | |
212 | match on out_port --- --- --- yes yes | |
213 | match on flow_cookie --- yes yes yes yes | |
214 | match on table_id yes yes yes yes yes | |
215 | controller chooses table_id yes yes yes | |
216 | updates flow_cookie yes --- --- | |
217 | updates OFPFF_SEND_FLOW_REM yes + + | |
218 | honors OFPFF_CHECK_OVERLAP yes + + | |
219 | updates idle_timeout yes + + | |
220 | updates hard_timeout yes + + | |
221 | resets idle timer yes + + | |
222 | resets hard timer yes yes yes | |
223 | zeros counters yes + + | |
224 | may add a new flow yes # # | |
12442ec5 BP |
225 | sends flow_removed message --- --- --- % % |
226 | ||
227 | (+) "modify" and "modify-strict" only take these actions when they | |
228 | create a new flow, not when they update an existing flow. | |
229 | ||
230 | (%) "delete" and "delete_strict" generates a flow_removed message if | |
231 | the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set. | |
232 | (Each controller can separately control whether it wants to | |
233 | receive the generated messages.) | |
234 | ||
3432cb4e BP |
235 | (#) "modify" and "modify-strict" only add a new flow if the flow_mod |
236 | does not match on any bits of the flow cookie | |
542cc9bb | 237 | ``` |
3432cb4e BP |
238 | |
239 | OpenFlow 1.2 | |
240 | ------------ | |
241 | ||
242 | OpenFlow 1.2 makes these changes: | |
243 | ||
542cc9bb TG |
244 | - Only "add" commands ever add flows, "modify" and "modify-strict" |
245 | never do. | |
3432cb4e | 246 | |
542cc9bb TG |
247 | - A new flag OFPFF_RESET_COUNTS now controls whether "modify" and |
248 | "modify-strict" reset counters, whereas previously they never | |
249 | reset counters (except when they inserted a new flow). | |
3432cb4e | 250 | |
542cc9bb | 251 | ``` |
3432cb4e BP |
252 | MODIFY DELETE |
253 | ADD MODIFY STRICT DELETE STRICT | |
254 | === ====== ====== ====== ====== | |
255 | match on priority yes --- yes --- yes | |
256 | match on out_port --- --- --- yes yes | |
257 | match on flow_cookie --- yes yes yes yes | |
258 | match on table_id yes yes yes yes yes | |
259 | controller chooses table_id yes yes yes | |
260 | updates flow_cookie yes --- --- | |
261 | updates OFPFF_SEND_FLOW_REM yes --- --- | |
262 | honors OFPFF_CHECK_OVERLAP yes --- --- | |
263 | updates idle_timeout yes --- --- | |
264 | updates hard_timeout yes --- --- | |
265 | resets idle timer yes --- --- | |
266 | resets hard timer yes yes yes | |
267 | zeros counters yes & & | |
268 | may add a new flow yes --- --- | |
269 | sends flow_removed message --- --- --- % % | |
270 | ||
271 | (%) "delete" and "delete_strict" generates a flow_removed message if | |
272 | the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set. | |
273 | (Each controller can separately control whether it wants to | |
274 | receive the generated messages.) | |
275 | ||
276 | (&) "modify" and "modify-strict" reset counters if the | |
277 | OFPFF_RESET_COUNTS flag is specified. | |
542cc9bb | 278 | ``` |
3432cb4e BP |
279 | |
280 | OpenFlow 1.3 | |
281 | ------------ | |
282 | ||
283 | OpenFlow 1.3 makes these changes: | |
284 | ||
542cc9bb TG |
285 | - Behavior for a table_id of 255 is now defined, for "delete" and |
286 | "delete-strict" commands, as meaning to delete from all tables. | |
287 | A table_id of 255 is now explicitly invalid for other commands. | |
3432cb4e | 288 | |
542cc9bb TG |
289 | - New flags OFPFF_NO_PKT_COUNTS and OFPFF_NO_BYT_COUNTS for "add" |
290 | operations. | |
3432cb4e BP |
291 | |
292 | The table for 1.3 is the same as the one shown above for 1.2. | |
293 | ||
12442ec5 | 294 | |
c37c0382 | 295 | OpenFlow 1.4 |
82c22d34 BP |
296 | ----------- |
297 | ||
298 | OpenFlow 1.4 makes these changes: | |
299 | ||
300 | - Adds the "importance" field to flow_mods, but it does not | |
301 | explicitly specify which kinds of flow_mods set the importance. | |
302 | For consistency, Open vSwitch uses the same rule for importance | |
303 | as for idle_timeout and hard_timeout, that is, only an "ADD" | |
304 | flow_mod sets the importance. (This issue has been filed with | |
305 | the ONF as EXT-496.) | |
c37c0382 | 306 | |
82c22d34 BP |
307 | - Eviction Mechanism to automatically delete entries of lower |
308 | importance to make space for newer entries. | |
c37c0382 | 309 | |
1c38055d JR |
310 | |
311 | OpenFlow 1.4 Bundles | |
312 | ==================== | |
313 | ||
314 | Open vSwitch makes all flow table modifications atomically, i.e., any | |
315 | datapath packet only sees flow table configurations either before or | |
316 | after any change made by any flow_mod. For example, if a controller | |
317 | removes all flows with a single OpenFlow "flow_mod", no packet sees an | |
318 | intermediate version of the OpenFlow pipeline where only some of the | |
319 | flows have been deleted. | |
320 | ||
321 | It should be noted that Open vSwitch caches datapath flows, and that | |
322 | the cached flows are NOT flushed immediately when a flow table | |
323 | changes. Instead, the datapath flows are revalidated against the new | |
324 | flow table as soon as possible, and usually within one second of the | |
325 | modification. This design amortizes the cost of datapath cache | |
326 | flushing across multiple flow table changes, and has a significant | |
327 | performance effect during simultaneous heavy flow table churn and high | |
328 | traffic load. This means that different cached datapath flows may | |
329 | have been computed based on a different flow table configurations, but | |
330 | each of the datapath flows is guaranteed to have been computed over a | |
331 | coherent view of the flow tables, as described above. | |
332 | ||
333 | With OpenFlow 1.4 bundles this atomicity can be extended across an | |
334 | arbitrary set of flow_mods. Bundles are supported for flow_mod and | |
335 | port_mod messages only. For flow_mods, both 'atomic' and 'ordered' | |
336 | bundle flags are trivially supported, as all bundled messages are | |
337 | executed in the order they were added and all flow table modifications | |
338 | are now atomic to the datapath. Port mods may not appear in atomic | |
339 | bundles, as port status modifications are not atomic. | |
340 | ||
341 | To support bundles, ovs-ofctl has a '--bundle' option that makes the | |
342 | flow mod commands ('add-flow', 'add-flows', 'mod-flows', 'del-flows', | |
343 | and 'replace-flows') use an OpenFlow 1.4 bundle to operate the | |
344 | modifications as a single atomic transaction. If any of the flow mods | |
345 | in a transaction fail, none of them are executed. All flow mods in a | |
346 | bundle appear to datapath lookups simultaneously. | |
347 | ||
348 | Furthermore, ovs-ofctl 'add-flow' and 'add-flows' commands now accept | |
349 | arbitrary flow mods as an input by allowing the flow specification to | |
350 | start with an explicit 'add', 'modify', 'modify_strict', 'delete', or | |
351 | 'delete_strict' keyword. A missing keyword is treated as 'add', so | |
352 | this is fully backwards compatible. With the new '--bundle' option | |
353 | all the flow mods are executed as a single atomic transaction using an | |
354 | OpenFlow 1.4 bundle. Without the '--bundle' option the flow mods are | |
355 | executed in order up to the first failing flow_mod, and in case of an | |
356 | error the earlier successful flow_mods are not rolled back. | |
357 | ||
358 | ||
4d197ebb BP |
359 | OFPT_PACKET_IN |
360 | ============== | |
361 | ||
362 | The OpenFlow 1.1 specification for OFPT_PACKET_IN is confusing. The | |
363 | definition in OF1.1 openflow.h is[*]: | |
364 | ||
542cc9bb | 365 | ``` |
4d197ebb BP |
366 | /* Packet received on port (datapath -> controller). */ |
367 | struct ofp_packet_in { | |
368 | struct ofp_header header; | |
369 | uint32_t buffer_id; /* ID assigned by datapath. */ | |
370 | uint32_t in_port; /* Port on which frame was received. */ | |
371 | uint32_t in_phy_port; /* Physical Port on which frame was received. */ | |
372 | uint16_t total_len; /* Full length of frame. */ | |
373 | uint8_t reason; /* Reason packet is being sent (one of OFPR_*) */ | |
374 | uint8_t table_id; /* ID of the table that was looked up */ | |
375 | uint8_t data[0]; /* Ethernet frame, halfway through 32-bit word, | |
376 | so the IP header is 32-bit aligned. The | |
377 | amount of data is inferred from the length | |
378 | field in the header. Because of padding, | |
379 | offsetof(struct ofp_packet_in, data) == | |
380 | sizeof(struct ofp_packet_in) - 2. */ | |
381 | }; | |
382 | OFP_ASSERT(sizeof(struct ofp_packet_in) == 24); | |
542cc9bb | 383 | ``` |
4d197ebb BP |
384 | |
385 | The confusing part is the comment on the data[] member. This comment | |
386 | is a leftover from OF1.0 openflow.h, in which the comment was correct: | |
387 | sizeof(struct ofp_packet_in) is 20 in OF1.0 and offsetof(struct | |
388 | ofp_packet_in, data) is 18. When OF1.1 was written, the structure | |
389 | members were changed but the comment was carelessly not updated, and | |
390 | the comment became wrong: sizeof(struct ofp_packet_in) and | |
391 | offsetof(struct ofp_packet_in, data) are both 24 in OF1.1. | |
392 | ||
393 | That leaves the question of how to implement ofp_packet_in in OF1.1. | |
394 | The OpenFlow reference implementation for OF1.1 does not include any | |
395 | padding, that is, the first byte of the encapsulated frame immediately | |
396 | follows the 'table_id' member without a gap. Open vSwitch therefore | |
397 | implements it the same way for compatibility. | |
398 | ||
399 | For an earlier discussion, please see the thread archived at: | |
400 | https://mailman.stanford.edu/pipermail/openflow-discuss/2011-August/002604.html | |
401 | ||
402 | [*] The quoted definition is directly from OF1.1. Definitions used | |
403 | inside OVS omit the 8-byte ofp_header members, so the sizes in | |
404 | this discussion are 8 bytes larger than those declared in OVS | |
405 | header files. | |
406 | ||
407 | ||
df778240 BP |
408 | VLAN Matching |
409 | ============= | |
410 | ||
411 | The 802.1Q VLAN header causes more trouble than any other 4 bytes in | |
412 | networking. More specifically, three versions of OpenFlow and Open | |
413 | vSwitch have among them four different ways to match the contents and | |
414 | presence of the VLAN header. The following table describes how each | |
415 | version works. | |
416 | ||
417 | Match NXM OF1.0 OF1.1 OF1.2 | |
418 | ----- --------- ----------- ----------- ------------ | |
419 | [1] 0000/0000 ????/1,??/? ????/1,??/? 0000/0000,-- | |
420 | [2] 0000/ffff ffff/0,??/? ffff/0,??/? 0000/ffff,-- | |
421 | [3] 1xxx/1fff 0xxx/0,??/1 0xxx/0,??/1 1xxx/ffff,-- | |
422 | [4] z000/f000 ????/1,0y/0 fffe/0,0y/0 1000/1000,0y | |
423 | [5] zxxx/ffff 0xxx/0,0y/0 0xxx/0,0y/0 1xxx/ffff,0y | |
424 | [6] 0000/0fff <none> <none> <none> | |
425 | [7] 0000/f000 <none> <none> <none> | |
426 | [8] 0000/efff <none> <none> <none> | |
427 | [9] 1001/1001 <none> <none> 1001/1001,-- | |
428 | [10] 3000/3000 <none> <none> <none> | |
5fec03b1 | 429 | [11] 1000/1000 <none> fffe/0,??/1 1000/1000,-- |
df778240 BP |
430 | |
431 | Each column is interpreted as follows. | |
432 | ||
542cc9bb | 433 | - Match: See the list below. |
df778240 | 434 | |
542cc9bb TG |
435 | - NXM: xxxx/yyyy means NXM_OF_VLAN_TCI_W with value xxxx and mask |
436 | yyyy. A mask of 0000 is equivalent to omitting | |
437 | NXM_OF_VLAN_TCI(_W), a mask of ffff is equivalent to | |
438 | NXM_OF_VLAN_TCI. | |
df778240 | 439 | |
053df7bd BP |
440 | - OF1.0 and OF1.1: wwww/x,yy/z means dl_vlan wwww, OFPFW_DL_VLAN x, |
441 | dl_vlan_pcp yy, and OFPFW_DL_VLAN_PCP z. If OFPFW_DL_VLAN or | |
442 | OFPFW_DL_VLAN_PCP is 1, the corresponding field value is | |
443 | wildcarded, otherwise it is matched. ? means that the given bits | |
444 | are ignored (their conventional values are 0000/x,00/0 in OF1.0, | |
445 | 0000/x,00/1 in OF1.1; x is never ignored). <none> means that the | |
446 | given match is not supported. | |
df778240 | 447 | |
542cc9bb TG |
448 | - OF1.2: xxxx/yyyy,zz means OXM_OF_VLAN_VID_W with value xxxx and |
449 | mask yyyy, and OXM_OF_VLAN_PCP (which is not maskable) with | |
450 | value zz. A mask of 0000 is equivalent to omitting | |
451 | OXM_OF_VLAN_VID(_W), a mask of ffff is equivalent to | |
452 | OXM_OF_VLAN_VID. -- means that OXM_OF_VLAN_PCP is omitted. | |
453 | <none> means that the given match is not supported. | |
df778240 BP |
454 | |
455 | The matches are: | |
456 | ||
457 | [1] Matches any packet, that is, one without an 802.1Q header or with | |
458 | an 802.1Q header with any TCI value. | |
459 | ||
460 | [2] Matches only packets without an 802.1Q header. | |
461 | ||
462 | NXM: Any match with (vlan_tci == 0) and (vlan_tci_mask & 0x1000) | |
463 | != 0 is equivalent to the one listed in the table. | |
464 | ||
465 | OF1.0: The spec doesn't define behavior if dl_vlan is set to | |
466 | 0xffff and OFPFW_DL_VLAN_PCP is not set. | |
467 | ||
468 | OF1.1: The spec says explicitly to ignore dl_vlan_pcp when | |
469 | dl_vlan is set to 0xffff. | |
470 | ||
471 | OF1.2: The spec doesn't say what should happen if (vlan_vid == 0) | |
472 | and (vlan_vid_mask & 0x1000) != 0 but (vlan_vid_mask != 0x1000), | |
473 | but it would be straightforward to also interpret as [2]. | |
474 | ||
475 | [3] Matches only packets that have an 802.1Q header with VID xxx (and | |
476 | any PCP). | |
477 | ||
478 | [4] Matches only packets that have an 802.1Q header with PCP y (and | |
479 | any VID). | |
480 | ||
481 | NXM: z is ((y << 1) | 1). | |
482 | ||
483 | OF1.0: The spec isn't very clear, but OVS implements it this way. | |
484 | ||
485 | OF1.2: Presumably other masks such that (vlan_vid_mask & 0x1fff) | |
486 | == 0x1000 would also work, but the spec doesn't define their | |
487 | behavior. | |
488 | ||
489 | [5] Matches only packets that have an 802.1Q header with VID xxx and | |
490 | PCP y. | |
491 | ||
492 | NXM: z is ((y << 1) | 1). | |
493 | ||
494 | OF1.2: Presumably other masks such that (vlan_vid_mask & 0x1fff) | |
495 | == 0x1fff would also work. | |
496 | ||
497 | [6] Matches packets with no 802.1Q header or with an 802.1Q header | |
498 | with a VID of 0. Only possible with NXM. | |
499 | ||
500 | [7] Matches packets with no 802.1Q header or with an 802.1Q header | |
501 | with a PCP of 0. Only possible with NXM. | |
502 | ||
503 | [8] Matches packets with no 802.1Q header or with an 802.1Q header | |
504 | with both VID and PCP of 0. Only possible with NXM. | |
505 | ||
506 | [9] Matches only packets that have an 802.1Q header with an | |
507 | odd-numbered VID (and any PCP). Only possible with NXM and | |
508 | OF1.2. (This is just an example; one can match on any desired | |
509 | VID bit pattern.) | |
510 | ||
511 | [10] Matches only packets that have an 802.1Q header with an | |
512 | odd-numbered PCP (and any VID). Only possible with NXM. (This | |
513 | is just an example; one can match on any desired VID bit | |
514 | pattern.) | |
515 | ||
5fec03b1 BP |
516 | [11] Matches any packet with an 802.1Q header, regardless of VID or |
517 | PCP. | |
518 | ||
df778240 BP |
519 | Additional notes: |
520 | ||
542cc9bb TG |
521 | - OF1.2: The top three bits of OXM_OF_VLAN_VID are fixed to zero, |
522 | so bits 13, 14, and 15 in the masks listed in the table may be | |
523 | set to arbitrary values, as long as the corresponding value bits | |
524 | are also zero. The suggested ffff mask for [2], [3], and [5] | |
525 | allows a shorter OXM representation (the mask is omitted) than | |
526 | the minimal 1fff mask. | |
df778240 BP |
527 | |
528 | ||
f66b87de BP |
529 | Flow Cookies |
530 | ============ | |
531 | ||
532 | OpenFlow 1.0 and later versions have the concept of a "flow cookie", | |
533 | which is a 64-bit integer value attached to each flow. The treatment | |
534 | of the flow cookie has varied greatly across OpenFlow versions, | |
535 | however. | |
536 | ||
537 | In OpenFlow 1.0: | |
538 | ||
542cc9bb | 539 | - OFPFC_ADD set the cookie in the flow that it added. |
f66b87de | 540 | |
542cc9bb TG |
541 | - OFPFC_MODIFY and OFPFC_MODIFY_STRICT updated the cookie for |
542 | the flow or flows that it modified. | |
f66b87de | 543 | |
542cc9bb | 544 | - OFPST_FLOW messages included the flow cookie. |
f66b87de | 545 | |
542cc9bb TG |
546 | - OFPT_FLOW_REMOVED messages reported the cookie of the flow |
547 | that was removed. | |
f66b87de BP |
548 | |
549 | OpenFlow 1.1 made the following changes: | |
550 | ||
542cc9bb TG |
551 | - Flow mod operations OFPFC_MODIFY, OFPFC_MODIFY_STRICT, |
552 | OFPFC_DELETE, and OFPFC_DELETE_STRICT, plus flow stats | |
553 | requests and aggregate stats requests, gained the ability to | |
554 | match on flow cookies with an arbitrary mask. | |
f66b87de | 555 | |
542cc9bb TG |
556 | - OFPFC_MODIFY and OFPFC_MODIFY_STRICT were changed to add a |
557 | new flow, in the case of no match, only if the flow table | |
558 | modification operation did not match on the cookie field. | |
559 | (In OpenFlow 1.0, modify operations always added a new flow | |
560 | when there was no match.) | |
f66b87de | 561 | |
542cc9bb TG |
562 | - OFPFC_MODIFY and OFPFC_MODIFY_STRICT no longer updated flow |
563 | cookies. | |
f66b87de BP |
564 | |
565 | OpenFlow 1.2 made the following changes: | |
566 | ||
542cc9bb TG |
567 | - OFPC_MODIFY and OFPFC_MODIFY_STRICT were changed to never |
568 | add a new flow, regardless of whether the flow cookie was | |
569 | used for matching. | |
f66b87de BP |
570 | |
571 | Open vSwitch support for OpenFlow 1.0 implements the OpenFlow 1.0 | |
572 | behavior with the following extensions: | |
573 | ||
542cc9bb TG |
574 | - An NXM extension field NXM_NX_COOKIE(_W) allows the NXM |
575 | versions of OFPFC_MODIFY, OFPFC_MODIFY_STRICT, OFPFC_DELETE, | |
576 | and OFPFC_DELETE_STRICT flow_mods, plus flow stats requests | |
577 | and aggregate stats requests, to match on flow cookies with | |
578 | arbitrary masks. This is much like the equivalent OpenFlow | |
579 | 1.1 feature. | |
580 | ||
581 | - Like OpenFlow 1.1, OFPC_MODIFY and OFPFC_MODIFY_STRICT add a | |
582 | new flow if there is no match and the mask is zero (or not | |
583 | given). | |
584 | ||
585 | - The "cookie" field in OFPT_FLOW_MOD and NXT_FLOW_MOD messages | |
586 | is used as the cookie value for OFPFC_ADD commands, as | |
587 | described in OpenFlow 1.0. For OFPFC_MODIFY and | |
588 | OFPFC_MODIFY_STRICT commands, the "cookie" field is used as a | |
589 | new cookie for flows that match unless it is UINT64_MAX, in | |
590 | which case the flow's cookie is not updated. | |
591 | ||
592 | - NXT_PACKET_IN (the Nicira extended version of | |
593 | OFPT_PACKET_IN) reports the cookie of the rule that | |
594 | generated the packet, or all-1-bits if no rule generated the | |
595 | packet. (Older versions of OVS used all-0-bits instead of | |
596 | all-1-bits.) | |
f66b87de | 597 | |
623e1caf JP |
598 | The following table shows the handling of different protocols when |
599 | receiving OFPFC_MODIFY and OFPFC_MODIFY_STRICT messages. A mask of 0 | |
600 | indicates either an explicit mask of zero or an implicit one by not | |
601 | specifying the NXM_NX_COOKIE(_W) field. | |
602 | ||
542cc9bb | 603 | ``` |
623e1caf JP |
604 | Match Update Add on miss Add on miss |
605 | cookie cookie mask!=0 mask==0 | |
606 | ====== ====== =========== =========== | |
607 | OpenFlow 1.0 no yes <always add on miss> | |
608 | OpenFlow 1.1 yes no no yes | |
609 | OpenFlow 1.2 yes no no no | |
610 | NXM yes yes* no yes | |
611 | ||
612 | * Updates the flow's cookie unless the "cookie" field is UINT64_MAX. | |
542cc9bb | 613 | ``` |
f66b87de | 614 | |
66abb12b BP |
615 | Multiple Table Support |
616 | ====================== | |
617 | ||
618 | OpenFlow 1.0 has only rudimentary support for multiple flow tables. | |
619 | Notably, OpenFlow 1.0 does not allow the controller to specify the | |
620 | flow table to which a flow is to be added. Open vSwitch adds an | |
621 | extension for this purpose, which is enabled on a per-OpenFlow | |
622 | connection basis using the NXT_FLOW_MOD_TABLE_ID message. When the | |
623 | extension is enabled, the upper 8 bits of the 'command' member in an | |
624 | OFPT_FLOW_MOD or NXT_FLOW_MOD message designates the table to which a | |
625 | flow is to be added. | |
626 | ||
627 | The Open vSwitch software switch implementation offers 255 flow | |
628 | tables. On packet ingress, only the first flow table (table 0) is | |
629 | searched, and the contents of the remaining tables are not considered | |
630 | in any way. Tables other than table 0 only come into play when an | |
631 | NXAST_RESUBMIT_TABLE action specifies another table to search. | |
632 | ||
633 | Tables 128 and above are reserved for use by the switch itself. | |
634 | Controllers should use only tables 0 through 127. | |
635 | ||
636 | ||
82c22d34 BP |
637 | OFPTC_* Table Configuration |
638 | =========================== | |
639 | ||
640 | This section covers the history of the OFPTC_* table configuration | |
641 | bits across OpenFlow versions. | |
642 | ||
643 | OpenFlow 1.0 flow tables had fixed configurations. | |
644 | ||
645 | OpenFlow 1.1 enabled controllers to configure behavior upon flow table | |
646 | miss and added the OFPTC_MISS_* constants for that purpose. OFPTC_* | |
647 | did not control anything else but it was nevertheless conceptualized | |
648 | as a set of bit-fields instead of an enum. OF1.1 added the | |
649 | OFPT_TABLE_MOD message to set OFPTC_MISS_* for a flow table and added | |
650 | the 'config' field to the OFPST_TABLE reply to report the current | |
651 | setting. | |
652 | ||
653 | OpenFlow 1.2 did not change anything in this regard. | |
654 | ||
655 | OpenFlow 1.3 switched to another means to changing flow table miss | |
656 | behavior and deprecated OFPTC_MISS_* without adding any more OFPTC_* | |
657 | constants. This meant that OFPT_TABLE_MOD now had no purpose at all, | |
658 | but OF1.3 kept it around "for backward compatibility with older and | |
659 | newer versions of the specification." At the same time, OF1.3 | |
660 | introduced a new message OFPMP_TABLE_FEATURES that included a field | |
661 | 'config' documented as reporting the OFPTC_* values set with | |
662 | OFPT_TABLE_MOD; of course this served no real purpose because no | |
663 | OFPTC_* values are defined. OF1.3 did remove the OFPTC_* field from | |
664 | OFPMP_TABLE (previously named OFPST_TABLE). | |
665 | ||
666 | OpenFlow 1.4 defined two new OFPTC_* constants, OFPTC_EVICTION and | |
667 | OFPTC_VACANCY_EVENTS, using bits that did not overlap with | |
668 | OFPTC_MISS_* even though those bits had not been defined since OF1.2. | |
669 | OFPT_TABLE_MOD still controlled these settings. The field for OFPTC_* | |
670 | values in OFPMP_TABLE_FEATURES was renamed from 'config' to | |
671 | 'capabilities' and documented as reporting the flags that are | |
672 | supported in a OFPT_TABLE_MOD message. The OFPMP_TABLE_DESC message | |
673 | newly added in OF1.4 reported the OFPTC_* setting. | |
674 | ||
675 | OpenFlow 1.5 did not change anything in this regard. | |
676 | ||
677 | The following table summarizes. The columns say: | |
678 | ||
679 | - OpenFlow version(s). | |
680 | ||
681 | - The OFPTC_* flags defined in those versions. | |
682 | ||
683 | - Whether OFPT_TABLE_MOD can modify OFPTC_* flags. | |
684 | ||
685 | - Whether OFPST_TABLE/OFPMP_TABLE reports the OFPTC_* flags. | |
686 | ||
687 | - What OFPMP_TABLE_FEATURES reports (if it exists): either the | |
688 | current configuration or the switch's capabilities. | |
689 | ||
690 | - Whether OFPMP_TABLE_DESC reports the current configuration. | |
691 | ||
692 | OpenFlow OFPTC_* flags TABLE_MOD stats? TABLE_FEATURES TABLE_DESC | |
693 | --------- ----------------------- --------- ------ -------------- ---------- | |
694 | OF1.0 none no[*][+] no[*] nothing[*][+] no[*][+] | |
695 | OF1.1/1.2 MISS_* yes yes nothing[+] no[+] | |
696 | OF1.3 none yes[*] no[*] config[*] no[*][+] | |
697 | OF1.4/1.5 EVICTION/VACANCY_EVENTS yes no capabilities yes | |
698 | ||
699 | [*] Nothing to report/change anyway. | |
700 | ||
701 | [+] No such message. | |
702 | ||
703 | ||
d31f1109 JP |
704 | IPv6 |
705 | ==== | |
706 | ||
707 | Open vSwitch supports stateless handling of IPv6 packets. Flows can be | |
708 | written to support matching TCP, UDP, and ICMPv6 headers within an IPv6 | |
685a51a5 JP |
709 | packet. Deeper matching of some Neighbor Discovery messages is also |
710 | supported. | |
d31f1109 JP |
711 | |
712 | IPv6 was not designed to interact well with middle-boxes. This, | |
713 | combined with Open vSwitch's stateless nature, have affected the | |
714 | processing of IPv6 traffic, which is detailed below. | |
715 | ||
716 | Extension Headers | |
717 | ----------------- | |
718 | ||
719 | The base IPv6 header is incredibly simple with the intention of only | |
720 | containing information relevant for routing packets between two | |
721 | endpoints. IPv6 relies heavily on the use of extension headers to | |
722 | provide any other functionality. Unfortunately, the extension headers | |
723 | were designed in such a way that it is impossible to move to the next | |
724 | header (including the layer-4 payload) unless the current header is | |
725 | understood. | |
726 | ||
727 | Open vSwitch will process the following extension headers and continue | |
728 | to the next header: | |
729 | ||
542cc9bb TG |
730 | * Fragment (see the next section) |
731 | * AH (Authentication Header) | |
732 | * Hop-by-Hop Options | |
733 | * Routing | |
734 | * Destination Options | |
d31f1109 JP |
735 | |
736 | When a header is encountered that is not in that list, it is considered | |
737 | "terminal". A terminal header's IPv6 protocol value is stored in | |
738 | "nw_proto" for matching purposes. If a terminal header is TCP, UDP, or | |
739 | ICMPv6, the packet will be further processed in an attempt to extract | |
740 | layer-4 information. | |
741 | ||
742 | Fragments | |
743 | --------- | |
744 | ||
745 | IPv6 requires that every link in the internet have an MTU of 1280 octets | |
746 | or greater (RFC 2460). As such, a terminal header (as described above in | |
747 | "Extension Headers") in the first fragment should generally be | |
748 | reachable. In this case, the terminal header's IPv6 protocol type is | |
749 | stored in the "nw_proto" field for matching purposes. If a terminal | |
750 | header cannot be found in the first fragment (one with a fragment offset | |
751 | of zero), the "nw_proto" field is set to 0. Subsequent fragments (those | |
752 | with a non-zero fragment offset) have the "nw_proto" field set to the | |
753 | IPv6 protocol type for fragments (44). | |
754 | ||
755 | Jumbograms | |
756 | ---------- | |
757 | ||
758 | An IPv6 jumbogram (RFC 2675) is a packet containing a payload longer | |
759 | than 65,535 octets. A jumbogram is only relevant in subnets with a link | |
760 | MTU greater than 65,575 octets, and are not required to be supported on | |
761 | nodes that do not connect to link with such large MTUs. Currently, Open | |
762 | vSwitch doesn't process jumbograms. | |
763 | ||
764 | ||
946350dc BP |
765 | In-Band Control |
766 | =============== | |
767 | ||
56e9c3b9 BP |
768 | Motivation |
769 | ---------- | |
770 | ||
771 | An OpenFlow switch must establish and maintain a TCP network | |
772 | connection to its controller. There are two basic ways to categorize | |
773 | the network that this connection traverses: either it is completely | |
774 | separate from the one that the switch is otherwise controlling, or its | |
775 | path may overlap the network that the switch controls. We call the | |
776 | former case "out-of-band control", the latter case "in-band control". | |
777 | ||
778 | Out-of-band control has the following benefits: | |
779 | ||
542cc9bb TG |
780 | - Simplicity: Out-of-band control slightly simplifies the switch |
781 | implementation. | |
56e9c3b9 | 782 | |
542cc9bb TG |
783 | - Reliability: Excessive switch traffic volume cannot interfere |
784 | with control traffic. | |
56e9c3b9 | 785 | |
542cc9bb TG |
786 | - Integrity: Machines not on the control network cannot |
787 | impersonate a switch or a controller. | |
56e9c3b9 | 788 | |
542cc9bb TG |
789 | - Confidentiality: Machines not on the control network cannot |
790 | snoop on control traffic. | |
56e9c3b9 BP |
791 | |
792 | In-band control, on the other hand, has the following advantages: | |
793 | ||
542cc9bb TG |
794 | - No dedicated port: There is no need to dedicate a physical |
795 | switch port to control, which is important on switches that have | |
796 | few ports (e.g. wireless routers, low-end embedded platforms). | |
56e9c3b9 | 797 | |
542cc9bb TG |
798 | - No dedicated network: There is no need to build and maintain a |
799 | separate control network. This is important in many | |
800 | environments because it reduces proliferation of switches and | |
801 | wiring. | |
56e9c3b9 BP |
802 | |
803 | Open vSwitch supports both out-of-band and in-band control. This | |
804 | section describes the principles behind in-band control. See the | |
805 | description of the Controller table in ovs-vswitchd.conf.db(5) to | |
806 | configure OVS for in-band control. | |
807 | ||
808 | Principles | |
809 | ---------- | |
810 | ||
811 | The fundamental principle of in-band control is that an OpenFlow | |
812 | switch must recognize and switch control traffic without involving the | |
813 | OpenFlow controller. All the details of implementing in-band control | |
814 | are special cases of this principle. | |
815 | ||
816 | The rationale for this principle is simple. If the switch does not | |
817 | handle in-band control traffic itself, then it will be caught in a | |
818 | contradiction: it must contact the controller, but it cannot, because | |
819 | only the controller can set up the flows that are needed to contact | |
820 | the controller. | |
821 | ||
822 | The following points describe important special cases of this | |
823 | principle. | |
824 | ||
542cc9bb TG |
825 | - In-band control must be implemented regardless of whether the |
826 | switch is connected. | |
827 | ||
828 | It is tempting to implement the in-band control rules only when | |
829 | the switch is not connected to the controller, using the | |
830 | reasoning that the controller should have complete control once | |
831 | it has established a connection with the switch. | |
832 | ||
833 | This does not work in practice. Consider the case where the | |
834 | switch is connected to the controller. Occasionally it can | |
835 | happen that the controller forgets or otherwise needs to obtain | |
836 | the MAC address of the switch. To do so, the controller sends a | |
837 | broadcast ARP request. A switch that implements the in-band | |
838 | control rules only when it is disconnected will then send an | |
839 | OFPT_PACKET_IN message up to the controller. The controller will | |
840 | be unable to respond, because it does not know the MAC address of | |
841 | the switch. This is a deadlock situation that can only be | |
842 | resolved by the switch noticing that its connection to the | |
843 | controller has hung and reconnecting. | |
844 | ||
845 | - In-band control must override flows set up by the controller. | |
846 | ||
847 | It is reasonable to assume that flows set up by the OpenFlow | |
848 | controller should take precedence over in-band control, on the | |
849 | basis that the controller should be in charge of the switch. | |
850 | ||
851 | Again, this does not work in practice. Reasonable controller | |
852 | implementations may set up a "last resort" fallback rule that | |
853 | wildcards every field and, e.g., sends it up to the controller or | |
854 | discards it. If a controller does that, then it will isolate | |
855 | itself from the switch. | |
856 | ||
857 | - The switch must recognize all control traffic. | |
858 | ||
859 | The fundamental principle of in-band control states, in part, | |
860 | that a switch must recognize control traffic without involving | |
861 | the OpenFlow controller. More specifically, the switch must | |
862 | recognize *all* control traffic. "False negatives", that is, | |
863 | packets that constitute control traffic but that the switch does | |
864 | not recognize as control traffic, lead to control traffic storms. | |
865 | ||
866 | Consider an OpenFlow switch that only recognizes control packets | |
867 | sent to or from that switch. Now suppose that two switches of | |
868 | this type, named A and B, are connected to ports on an Ethernet | |
869 | hub (not a switch) and that an OpenFlow controller is connected | |
870 | to a third hub port. In this setup, control traffic sent by | |
871 | switch A will be seen by switch B, which will send it to the | |
872 | controller as part of an OFPT_PACKET_IN message. Switch A will | |
873 | then see the OFPT_PACKET_IN message's packet, re-encapsulate it | |
874 | in another OFPT_PACKET_IN, and send it to the controller. Switch | |
875 | B will then see that OFPT_PACKET_IN, and so on in an infinite | |
876 | loop. | |
877 | ||
878 | Incidentally, the consequences of "false positives", where | |
879 | packets that are not control traffic are nevertheless recognized | |
880 | as control traffic, are much less severe. The controller will | |
881 | not be able to control their behavior, but the network will | |
882 | remain in working order. False positives do constitute a | |
883 | security problem. | |
884 | ||
885 | - The switch should use echo-requests to detect disconnection. | |
886 | ||
887 | TCP will notice that a connection has hung, but this can take a | |
888 | considerable amount of time. For example, with default settings | |
889 | the Linux kernel TCP implementation will retransmit for between | |
890 | 13 and 30 minutes, depending on the connection's retransmission | |
891 | timeout, according to kernel documentation. This is far too long | |
892 | for a switch to be disconnected, so an OpenFlow switch should | |
893 | implement its own connection timeout. OpenFlow OFPT_ECHO_REQUEST | |
894 | messages are the best way to do this, since they test the | |
895 | OpenFlow connection itself. | |
56e9c3b9 BP |
896 | |
897 | Implementation | |
898 | -------------- | |
899 | ||
900 | This section describes how Open vSwitch implements in-band control. | |
901 | Correctly implementing in-band control has proven difficult due to its | |
902 | many subtleties, and has thus gone through many iterations. Please | |
903 | read through and understand the reasoning behind the chosen rules | |
904 | before making modifications. | |
905 | ||
906 | Open vSwitch implements in-band control as "hidden" flows, that is, | |
907 | flows that are not visible through OpenFlow, and at a higher priority | |
908 | than wildcarded flows can be set up through OpenFlow. This is done so | |
909 | that the OpenFlow controller cannot interfere with them and possibly | |
910 | break connectivity with its switches. It is possible to see all | |
911 | flows, including in-band ones, with the ovs-appctl "bridge/dump-flows" | |
912 | command. | |
946350dc BP |
913 | |
914 | The Open vSwitch implementation of in-band control can hide traffic to | |
915 | arbitrary "remotes", where each remote is one TCP port on one IP address. | |
916 | Currently the remotes are automatically configured as the in-band OpenFlow | |
917 | controllers plus the OVSDB managers, if any. (The latter is a requirement | |
918 | because OVSDB managers are responsible for configuring OpenFlow controllers, | |
919 | so if the manager cannot be reached then OpenFlow cannot be reconfigured.) | |
920 | ||
921 | The following rules (with the OFPP_NORMAL action) are set up on any bridge | |
922 | that has any remotes: | |
923 | ||
924 | (a) DHCP requests sent from the local port. | |
925 | (b) ARP replies to the local port's MAC address. | |
926 | (c) ARP requests from the local port's MAC address. | |
927 | ||
928 | In-band also sets up the following rules for each unique next-hop MAC | |
929 | address for the remotes' IPs (the "next hop" is either the remote | |
930 | itself, if it is on a local subnet, or the gateway to reach the remote): | |
931 | ||
932 | (d) ARP replies to the next hop's MAC address. | |
933 | (e) ARP requests from the next hop's MAC address. | |
934 | ||
935 | In-band also sets up the following rules for each unique remote IP address: | |
936 | ||
937 | (f) ARP replies containing the remote's IP address as a target. | |
938 | (g) ARP requests containing the remote's IP address as a source. | |
939 | ||
940 | In-band also sets up the following rules for each unique remote (IP,port) | |
941 | pair: | |
942 | ||
943 | (h) TCP traffic to the remote's IP and port. | |
944 | (i) TCP traffic from the remote's IP and port. | |
945 | ||
946 | The goal of these rules is to be as narrow as possible to allow a | |
947 | switch to join a network and be able to communicate with the | |
948 | remotes. As mentioned earlier, these rules have higher priority | |
949 | than the controller's rules, so if they are too broad, they may | |
950 | prevent the controller from implementing its policy. As such, | |
951 | in-band actively monitors some aspects of flow and packet processing | |
952 | so that the rules can be made more precise. | |
953 | ||
954 | In-band control monitors attempts to add flows into the datapath that | |
955 | could interfere with its duties. The datapath only allows exact | |
956 | match entries, so in-band control is able to be very precise about | |
957 | the flows it prevents. Flows that miss in the datapath are sent to | |
958 | userspace to be processed, so preventing these flows from being | |
959 | cached in the "fast path" does not affect correctness. The only type | |
960 | of flow that is currently prevented is one that would prevent DHCP | |
961 | replies from being seen by the local port. For example, a rule that | |
962 | forwarded all DHCP traffic to the controller would not be allowed, | |
963 | but one that forwarded to all ports (including the local port) would. | |
964 | ||
965 | As mentioned earlier, packets that miss in the datapath are sent to | |
966 | the userspace for processing. The userspace has its own flow table, | |
967 | the "classifier", so in-band checks whether any special processing | |
968 | is needed before the classifier is consulted. If a packet is a DHCP | |
969 | response to a request from the local port, the packet is forwarded to | |
970 | the local port, regardless of the flow table. Note that this requires | |
971 | L7 processing of DHCP replies to determine whether the 'chaddr' field | |
972 | matches the MAC address of the local port. | |
973 | ||
974 | It is interesting to note that for an L3-based in-band control | |
975 | mechanism, the majority of rules are devoted to ARP traffic. At first | |
976 | glance, some of these rules appear redundant. However, each serves an | |
977 | important role. First, in order to determine the MAC address of the | |
978 | remote side (controller or gateway) for other ARP rules, we must allow | |
979 | ARP traffic for our local port with rules (b) and (c). If we are | |
980 | between a switch and its connection to the remote, we have to | |
981 | allow the other switch's ARP traffic to through. This is done with | |
982 | rules (d) and (e), since we do not know the addresses of the other | |
983 | switches a priori, but do know the remote's or gateway's. Finally, | |
984 | if the remote is running in a local guest VM that is not reached | |
985 | through the local port, the switch that is connected to the VM must | |
986 | allow ARP traffic based on the remote's IP address, since it will | |
987 | not know the MAC address of the local port that is sending the traffic | |
988 | or the MAC address of the remote in the guest VM. | |
989 | ||
990 | With a few notable exceptions below, in-band should work in most | |
1c38055d | 991 | network setups. The following are considered "supported" in the |
946350dc BP |
992 | current implementation: |
993 | ||
542cc9bb TG |
994 | - Locally Connected. The switch and remote are on the same |
995 | subnet. This uses rules (a), (b), (c), (h), and (i). | |
996 | ||
997 | - Reached through Gateway. The switch and remote are on | |
998 | different subnets and must go through a gateway. This uses | |
999 | rules (a), (b), (c), (h), and (i). | |
1000 | ||
1001 | - Between Switch and Remote. This switch is between another | |
1002 | switch and the remote, and we want to allow the other | |
1003 | switch's traffic through. This uses rules (d), (e), (h), and | |
1004 | (i). It uses (b) and (c) indirectly in order to know the MAC | |
1005 | address for rules (d) and (e). Note that DHCP for the other | |
1006 | switch will not work unless an OpenFlow controller explicitly lets this | |
1007 | switch pass the traffic. | |
1008 | ||
1009 | - Between Switch and Gateway. This switch is between another | |
1010 | switch and the gateway, and we want to allow the other switch's | |
1011 | traffic through. This uses the same rules and logic as the | |
1012 | "Between Switch and Remote" configuration described earlier. | |
1013 | ||
1014 | - Remote on Local VM. The remote is a guest VM on the | |
1015 | system running in-band control. This uses rules (a), (b), (c), | |
1016 | (h), and (i). | |
1017 | ||
1018 | - Remote on Local VM with Different Networks. The remote | |
1019 | is a guest VM on the system running in-band control, but the | |
1020 | local port is not used to connect to the remote. For | |
1021 | example, an IP address is configured on eth0 of the switch. The | |
1022 | remote's VM is connected through eth1 of the switch, but an | |
1023 | IP address has not been configured for that port on the switch. | |
1024 | As such, the switch will use eth0 to connect to the remote, | |
1025 | and eth1's rules about the local port will not work. In the | |
1026 | example, the switch attached to eth0 would use rules (a), (b), | |
1027 | (c), (h), and (i) on eth0. The switch attached to eth1 would use | |
1028 | rules (f), (g), (h), and (i). | |
946350dc BP |
1029 | |
1030 | The following are explicitly *not* supported by in-band control: | |
1031 | ||
542cc9bb TG |
1032 | - Specify Remote by Name. Currently, the remote must be |
1033 | identified by IP address. A naive approach would be to permit | |
1034 | all DNS traffic. Unfortunately, this would prevent the | |
1035 | controller from defining any policy over DNS. Since switches | |
1036 | that are located behind us need to connect to the remote, | |
1037 | in-band cannot simply add a rule that allows DNS traffic from | |
1038 | the local port. The "correct" way to support this is to parse | |
1039 | DNS requests to allow all traffic related to a request for the | |
1040 | remote's name through. Due to the potential security | |
1041 | problems and amount of processing, we decided to hold off for | |
1042 | the time-being. | |
1043 | ||
1044 | - Differing Remotes for Switches. All switches must know | |
1045 | the L3 addresses for all the remotes that other switches | |
1046 | may use, since rules need to be set up to allow traffic related | |
1047 | to those remotes through. See rules (f), (g), (h), and (i). | |
1048 | ||
1049 | - Differing Routes for Switches. In order for the switch to | |
1050 | allow other switches to connect to a remote through a | |
1051 | gateway, it allows the gateway's traffic through with rules (d) | |
1052 | and (e). If the routes to the remote differ for the two | |
1053 | switches, we will not know the MAC address of the alternate | |
1054 | gateway. | |
946350dc BP |
1055 | |
1056 | ||
f25d0cf3 BP |
1057 | Action Reproduction |
1058 | =================== | |
1059 | ||
1060 | It seems likely that many controllers, at least at startup, use the | |
1061 | OpenFlow "flow statistics" request to obtain existing flows, then | |
1062 | compare the flows' actions against the actions that they expect to | |
1063 | find. Before version 1.8.0, Open vSwitch always returned exact, | |
1064 | byte-for-byte copies of the actions that had been added to the flow | |
1065 | table. The current version of Open vSwitch does not always do this in | |
1066 | some exceptional cases. This section lists the exceptions that | |
1067 | controller authors must keep in mind if they compare actual actions | |
1068 | against desired actions in a bytewise fashion: | |
1069 | ||
542cc9bb TG |
1070 | - Open vSwitch zeros padding bytes in action structures, |
1071 | regardless of their values when the flows were added. | |
f25d0cf3 | 1072 | |
542cc9bb TG |
1073 | - Open vSwitch "normalizes" the instructions in OpenFlow 1.1 |
1074 | (and later) in the following way: | |
d01c980f | 1075 | |
542cc9bb TG |
1076 | * OVS sorts the instructions into the following order: |
1077 | Apply-Actions, Clear-Actions, Write-Actions, | |
1078 | Write-Metadata, Goto-Table. | |
d01c980f | 1079 | |
542cc9bb TG |
1080 | * OVS drops Apply-Actions instructions that have empty |
1081 | action lists. | |
d01c980f | 1082 | |
542cc9bb TG |
1083 | * OVS drops Write-Actions instructions that have empty |
1084 | action sets. | |
d01c980f | 1085 | |
f25d0cf3 BP |
1086 | Please report other discrepancies, if you notice any, so that we can |
1087 | fix or document them. | |
1088 | ||
1089 | ||
d31f1109 JP |
1090 | Suggestions |
1091 | =========== | |
1092 | ||
1093 | Suggestions to improve Open vSwitch are welcome at discuss@openvswitch.org. |