]>
Commit | Line | Data |
---|---|---|
d31f1109 JP |
1 | Design Decisions In Open vSwitch |
2 | ================================ | |
3 | ||
4 | This document describes design decisions that went into implementing | |
5 | Open vSwitch. While we believe these to be reasonable decisions, it is | |
6 | impossible to predict how Open vSwitch will be used in all environments. | |
7 | Understanding assumptions made by Open vSwitch is critical to a | |
8 | successful deployment. The end of this document contains contact | |
9 | information that can be used to let us know how we can make Open vSwitch | |
10 | more generally useful. | |
11 | ||
80d5aefd BP |
12 | Asynchronous Messages |
13 | ===================== | |
14 | ||
15 | Over time, Open vSwitch has added many knobs that control whether a | |
16 | given controller receives OpenFlow asynchronous messages. This | |
17 | section describes how all of these features interact. | |
18 | ||
19 | First, a service controller never receives any asynchronous messages | |
4550b647 MM |
20 | unless it changes its miss_send_len from the service controller |
21 | default of zero in one of the following ways: | |
22 | ||
23 | - Sending an OFPT_SET_CONFIG message with nonzero miss_send_len. | |
24 | ||
25 | - Sending any NXT_SET_ASYNC_CONFIG message: as a side effect, this | |
26 | message changes the miss_send_len to | |
27 | OFP_DEFAULT_MISS_SEND_LEN (128) for service controllers. | |
80d5aefd BP |
28 | |
29 | Second, OFPT_FLOW_REMOVED and NXT_FLOW_REMOVED messages are generated | |
30 | only if the flow that was removed had the OFPFF_SEND_FLOW_REM flag | |
31 | set. | |
32 | ||
a7349929 BP |
33 | Third, OFPT_PACKET_IN and NXT_PACKET_IN messages are sent only to |
34 | OpenFlow controller connections that have the correct connection ID | |
35 | (see "struct nx_controller_id" and "struct nx_action_controller"): | |
36 | ||
37 | - For packet-in messages generated by a NXAST_CONTROLLER action, | |
38 | the controller ID specified in the action. | |
39 | ||
40 | - For other packet-in messages, controller ID zero. (This is the | |
41 | default ID when an OpenFlow controller does not configure one.) | |
42 | ||
80d5aefd BP |
43 | Finally, Open vSwitch consults a per-connection table indexed by the |
44 | message type, reason code, and current role. The following table | |
45 | shows how this table is initialized by default when an OpenFlow | |
46 | connection is made. An entry labeled "yes" means that the message is | |
47 | sent, an entry labeled "---" means that the message is suppressed. | |
48 | ||
49 | master/ | |
50 | message and reason code other slave | |
51 | ---------------------------------------- ------- ----- | |
52 | OFPT_PACKET_IN / NXT_PACKET_IN | |
53 | OFPR_NO_MATCH yes --- | |
54 | OFPR_ACTION yes --- | |
55 | OFPR_INVALID_TTL --- --- | |
56 | ||
57 | OFPT_FLOW_REMOVED / NXT_FLOW_REMOVED | |
58 | OFPRR_IDLE_TIMEOUT yes --- | |
59 | OFPRR_HARD_TIMEOUT yes --- | |
60 | OFPRR_DELETE yes --- | |
61 | ||
62 | OFPT_PORT_STATUS | |
63 | OFPPR_ADD yes yes | |
64 | OFPPR_DELETE yes yes | |
65 | OFPPR_MODIFY yes yes | |
66 | ||
67 | The NXT_SET_ASYNC_CONFIG message directly sets all of the values in | |
68 | this table for the current connection. The | |
69 | OFPC_INVALID_TTL_TO_CONTROLLER bit in the OFPT_SET_CONFIG message | |
70 | controls the setting for OFPR_INVALID_TTL for the "master" role. | |
71 | ||
72 | ||
73 | OFPAT_ENQUEUE | |
74 | ============= | |
82172632 EJ |
75 | |
76 | The OpenFlow 1.0 specification requires the output port of the OFPAT_ENQUEUE | |
77 | action to "refer to a valid physical port (i.e. < OFPP_MAX) or OFPP_IN_PORT". | |
78 | Although OFPP_LOCAL is not less than OFPP_MAX, it is an 'internal' port which | |
79 | can have QoS applied to it in Linux. Since we allow the OFPAT_ENQUEUE to apply | |
80 | to 'internal' ports whose port numbers are less than OFPP_MAX, we interpret | |
81 | OFPP_LOCAL as a physical port and support OFPAT_ENQUEUE on it as well. | |
82 | ||
d31f1109 | 83 | |
12442ec5 BP |
84 | OFPT_FLOW_MOD |
85 | ============= | |
86 | ||
3432cb4e BP |
87 | The OpenFlow specification for the behavior of OFPT_FLOW_MOD is |
88 | confusing. The following tables summarize the Open vSwitch | |
12442ec5 BP |
89 | implementation of its behavior in the following categories: |
90 | ||
91 | - "match on priority": Whether the flow_mod acts only on flows | |
92 | whose priority matches that included in the flow_mod message. | |
93 | ||
94 | - "match on out_port": Whether the flow_mod acts only on flows | |
95 | that output to the out_port included in the flow_mod message (if | |
3432cb4e BP |
96 | out_port is not OFPP_NONE). OpenFlow 1.1 and later have a |
97 | similar feature (not listed separately here) for out_group. | |
98 | ||
99 | - "match on flow_cookie": Whether the flow_mod acts only on flows | |
100 | whose flow_cookie matches an optional controller-specified value | |
101 | and mask. | |
12442ec5 BP |
102 | |
103 | - "updates flow_cookie": Whether the flow_mod changes the | |
104 | flow_cookie of the flow or flows that it matches to the | |
105 | flow_cookie included in the flow_mod message. | |
106 | ||
107 | - "updates OFPFF_ flags": Whether the flow_mod changes the | |
108 | OFPFF_SEND_FLOW_REM flag of the flow or flows that it matches to | |
109 | the setting included in the flags of the flow_mod message. | |
110 | ||
111 | - "honors OFPFF_CHECK_OVERLAP": Whether the OFPFF_CHECK_OVERLAP | |
112 | flag in the flow_mod is significant. | |
113 | ||
114 | - "updates idle_timeout" and "updates hard_timeout": Whether the | |
115 | idle_timeout and hard_timeout in the flow_mod, respectively, | |
116 | have an effect on the flow or flows matched by the flow_mod. | |
117 | ||
118 | - "updates idle timer": Whether the flow_mod resets the per-flow | |
119 | timer that measures how long a flow has been idle. | |
120 | ||
121 | - "updates hard timer": Whether the flow_mod resets the per-flow | |
122 | timer that measures how long it has been since a flow was | |
123 | modified. | |
124 | ||
125 | - "zeros counters": Whether the flow_mod resets per-flow packet | |
126 | and byte counters to zero. | |
127 | ||
3432cb4e BP |
128 | - "may add a new flow": Whether the flow_mod may add a new flow to |
129 | the flow table. (Obviously this is always true for "add" | |
130 | commands but in some OpenFlow versions "modify" and | |
131 | "modify-strict" can also add new flows.) | |
132 | ||
12442ec5 BP |
133 | - "sends flow_removed message": Whether the flow_mod generates a |
134 | flow_removed message for the flow or flows that it affects. | |
135 | ||
136 | An entry labeled "yes" means that the flow mod type does have the | |
137 | indicated behavior, "---" means that it does not, an empty cell means | |
138 | that the property is not applicable, and other values are explained | |
139 | below the table. | |
140 | ||
3432cb4e BP |
141 | OpenFlow 1.0 |
142 | ------------ | |
143 | ||
12442ec5 BP |
144 | MODIFY DELETE |
145 | ADD MODIFY STRICT DELETE STRICT | |
146 | === ====== ====== ====== ====== | |
3432cb4e | 147 | match on priority yes --- yes --- yes |
906087ee | 148 | match on out_port --- --- --- yes yes |
3432cb4e BP |
149 | match on flow_cookie --- --- --- --- --- |
150 | match on table_id --- --- --- --- --- | |
151 | controller chooses table_id --- --- --- | |
12442ec5 BP |
152 | updates flow_cookie yes yes yes |
153 | updates OFPFF_SEND_FLOW_REM yes + + | |
154 | honors OFPFF_CHECK_OVERLAP yes + + | |
155 | updates idle_timeout yes + + | |
156 | updates hard_timeout yes + + | |
157 | resets idle timer yes + + | |
158 | resets hard timer yes yes yes | |
159 | zeros counters yes + + | |
3432cb4e BP |
160 | may add a new flow yes yes yes |
161 | sends flow_removed message --- --- --- % % | |
162 | ||
163 | (+) "modify" and "modify-strict" only take these actions when they | |
164 | create a new flow, not when they update an existing flow. | |
165 | ||
166 | (%) "delete" and "delete_strict" generates a flow_removed message if | |
167 | the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set. | |
168 | (Each controller can separately control whether it wants to | |
169 | receive the generated messages.) | |
170 | ||
171 | OpenFlow 1.1 | |
172 | ------------ | |
173 | ||
174 | OpenFlow 1.1 makes these changes: | |
175 | ||
176 | - The controller now must specify the table_id of the flow match | |
177 | searched and into which a flow may be inserted. Behavior for a | |
178 | table_id of 255 is undefined. | |
179 | ||
180 | - A flow_mod, except an "add", can now match on the flow_cookie. | |
181 | ||
182 | - When a flow_mod matches on the flow_cookie, "modify" and | |
183 | "modify-strict" never insert a new flow. | |
184 | ||
185 | MODIFY DELETE | |
186 | ADD MODIFY STRICT DELETE STRICT | |
187 | === ====== ====== ====== ====== | |
188 | match on priority yes --- yes --- yes | |
189 | match on out_port --- --- --- yes yes | |
190 | match on flow_cookie --- yes yes yes yes | |
191 | match on table_id yes yes yes yes yes | |
192 | controller chooses table_id yes yes yes | |
193 | updates flow_cookie yes --- --- | |
194 | updates OFPFF_SEND_FLOW_REM yes + + | |
195 | honors OFPFF_CHECK_OVERLAP yes + + | |
196 | updates idle_timeout yes + + | |
197 | updates hard_timeout yes + + | |
198 | resets idle timer yes + + | |
199 | resets hard timer yes yes yes | |
200 | zeros counters yes + + | |
201 | may add a new flow yes # # | |
12442ec5 BP |
202 | sends flow_removed message --- --- --- % % |
203 | ||
204 | (+) "modify" and "modify-strict" only take these actions when they | |
205 | create a new flow, not when they update an existing flow. | |
206 | ||
207 | (%) "delete" and "delete_strict" generates a flow_removed message if | |
208 | the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set. | |
209 | (Each controller can separately control whether it wants to | |
210 | receive the generated messages.) | |
211 | ||
3432cb4e BP |
212 | (#) "modify" and "modify-strict" only add a new flow if the flow_mod |
213 | does not match on any bits of the flow cookie | |
214 | ||
215 | OpenFlow 1.2 | |
216 | ------------ | |
217 | ||
218 | OpenFlow 1.2 makes these changes: | |
219 | ||
220 | - Only "add" commands ever add flows, "modify" and "modify-strict" | |
221 | never do. | |
222 | ||
223 | - A new flag OFPFF_RESET_COUNTS now controls whether "modify" and | |
224 | "modify-strict" reset counters, whereas previously they never | |
225 | reset counters (except when they inserted a new flow). | |
226 | ||
227 | MODIFY DELETE | |
228 | ADD MODIFY STRICT DELETE STRICT | |
229 | === ====== ====== ====== ====== | |
230 | match on priority yes --- yes --- yes | |
231 | match on out_port --- --- --- yes yes | |
232 | match on flow_cookie --- yes yes yes yes | |
233 | match on table_id yes yes yes yes yes | |
234 | controller chooses table_id yes yes yes | |
235 | updates flow_cookie yes --- --- | |
236 | updates OFPFF_SEND_FLOW_REM yes --- --- | |
237 | honors OFPFF_CHECK_OVERLAP yes --- --- | |
238 | updates idle_timeout yes --- --- | |
239 | updates hard_timeout yes --- --- | |
240 | resets idle timer yes --- --- | |
241 | resets hard timer yes yes yes | |
242 | zeros counters yes & & | |
243 | may add a new flow yes --- --- | |
244 | sends flow_removed message --- --- --- % % | |
245 | ||
246 | (%) "delete" and "delete_strict" generates a flow_removed message if | |
247 | the deleted flow or flows have the OFPFF_SEND_FLOW_REM flag set. | |
248 | (Each controller can separately control whether it wants to | |
249 | receive the generated messages.) | |
250 | ||
251 | (&) "modify" and "modify-strict" reset counters if the | |
252 | OFPFF_RESET_COUNTS flag is specified. | |
253 | ||
254 | OpenFlow 1.3 | |
255 | ------------ | |
256 | ||
257 | OpenFlow 1.3 makes these changes: | |
258 | ||
259 | - Behavior for a table_id of 255 is now defined, for "delete" and | |
260 | "delete-strict" commands, as meaning to delete from all tables. | |
261 | A table_id of 255 is now explicitly invalid for other commands. | |
262 | ||
263 | - New flags OFPFF_NO_PKT_COUNTS and OFPFF_NO_BYT_COUNTS for "add" | |
264 | operations. | |
265 | ||
266 | The table for 1.3 is the same as the one shown above for 1.2. | |
267 | ||
12442ec5 | 268 | |
4d197ebb BP |
269 | OFPT_PACKET_IN |
270 | ============== | |
271 | ||
272 | The OpenFlow 1.1 specification for OFPT_PACKET_IN is confusing. The | |
273 | definition in OF1.1 openflow.h is[*]: | |
274 | ||
275 | /* Packet received on port (datapath -> controller). */ | |
276 | struct ofp_packet_in { | |
277 | struct ofp_header header; | |
278 | uint32_t buffer_id; /* ID assigned by datapath. */ | |
279 | uint32_t in_port; /* Port on which frame was received. */ | |
280 | uint32_t in_phy_port; /* Physical Port on which frame was received. */ | |
281 | uint16_t total_len; /* Full length of frame. */ | |
282 | uint8_t reason; /* Reason packet is being sent (one of OFPR_*) */ | |
283 | uint8_t table_id; /* ID of the table that was looked up */ | |
284 | uint8_t data[0]; /* Ethernet frame, halfway through 32-bit word, | |
285 | so the IP header is 32-bit aligned. The | |
286 | amount of data is inferred from the length | |
287 | field in the header. Because of padding, | |
288 | offsetof(struct ofp_packet_in, data) == | |
289 | sizeof(struct ofp_packet_in) - 2. */ | |
290 | }; | |
291 | OFP_ASSERT(sizeof(struct ofp_packet_in) == 24); | |
292 | ||
293 | The confusing part is the comment on the data[] member. This comment | |
294 | is a leftover from OF1.0 openflow.h, in which the comment was correct: | |
295 | sizeof(struct ofp_packet_in) is 20 in OF1.0 and offsetof(struct | |
296 | ofp_packet_in, data) is 18. When OF1.1 was written, the structure | |
297 | members were changed but the comment was carelessly not updated, and | |
298 | the comment became wrong: sizeof(struct ofp_packet_in) and | |
299 | offsetof(struct ofp_packet_in, data) are both 24 in OF1.1. | |
300 | ||
301 | That leaves the question of how to implement ofp_packet_in in OF1.1. | |
302 | The OpenFlow reference implementation for OF1.1 does not include any | |
303 | padding, that is, the first byte of the encapsulated frame immediately | |
304 | follows the 'table_id' member without a gap. Open vSwitch therefore | |
305 | implements it the same way for compatibility. | |
306 | ||
307 | For an earlier discussion, please see the thread archived at: | |
308 | https://mailman.stanford.edu/pipermail/openflow-discuss/2011-August/002604.html | |
309 | ||
310 | [*] The quoted definition is directly from OF1.1. Definitions used | |
311 | inside OVS omit the 8-byte ofp_header members, so the sizes in | |
312 | this discussion are 8 bytes larger than those declared in OVS | |
313 | header files. | |
314 | ||
315 | ||
df778240 BP |
316 | VLAN Matching |
317 | ============= | |
318 | ||
319 | The 802.1Q VLAN header causes more trouble than any other 4 bytes in | |
320 | networking. More specifically, three versions of OpenFlow and Open | |
321 | vSwitch have among them four different ways to match the contents and | |
322 | presence of the VLAN header. The following table describes how each | |
323 | version works. | |
324 | ||
325 | Match NXM OF1.0 OF1.1 OF1.2 | |
326 | ----- --------- ----------- ----------- ------------ | |
327 | [1] 0000/0000 ????/1,??/? ????/1,??/? 0000/0000,-- | |
328 | [2] 0000/ffff ffff/0,??/? ffff/0,??/? 0000/ffff,-- | |
329 | [3] 1xxx/1fff 0xxx/0,??/1 0xxx/0,??/1 1xxx/ffff,-- | |
330 | [4] z000/f000 ????/1,0y/0 fffe/0,0y/0 1000/1000,0y | |
331 | [5] zxxx/ffff 0xxx/0,0y/0 0xxx/0,0y/0 1xxx/ffff,0y | |
332 | [6] 0000/0fff <none> <none> <none> | |
333 | [7] 0000/f000 <none> <none> <none> | |
334 | [8] 0000/efff <none> <none> <none> | |
335 | [9] 1001/1001 <none> <none> 1001/1001,-- | |
336 | [10] 3000/3000 <none> <none> <none> | |
337 | ||
338 | Each column is interpreted as follows. | |
339 | ||
340 | - Match: See the list below. | |
341 | ||
342 | - NXM: xxxx/yyyy means NXM_OF_VLAN_TCI_W with value xxxx and mask | |
343 | yyyy. A mask of 0000 is equivalent to omitting | |
344 | NXM_OF_VLAN_TCI(_W), a mask of ffff is equivalent to | |
345 | NXM_OF_VLAN_TCI. | |
346 | ||
347 | - OF1.0 and OF1.1: wwww/x,yy/z means dl_vlan wwww, OFPFW_DL_VLAN | |
348 | x, dl_vlan_pcp yy, and OFPFW_DL_VLAN_PCP z. ? means that the | |
701351f8 | 349 | given nibble is ignored (and conventionally 0 for wwww or yy, |
df778240 BP |
350 | conventionally 1 for x or z). <none> means that the given match |
351 | is not supported. | |
352 | ||
353 | - OF1.2: xxxx/yyyy,zz means OXM_OF_VLAN_VID_W with value xxxx and | |
354 | mask yyyy, and OXM_OF_VLAN_PCP (which is not maskable) with | |
355 | value zz. A mask of 0000 is equivalent to omitting | |
356 | OXM_OF_VLAN_VID(_W), a mask of ffff is equivalent to | |
357 | OXM_OF_VLAN_VID. -- means that OXM_OF_VLAN_PCP is omitted. | |
358 | <none> means that the given match is not supported. | |
359 | ||
360 | The matches are: | |
361 | ||
362 | [1] Matches any packet, that is, one without an 802.1Q header or with | |
363 | an 802.1Q header with any TCI value. | |
364 | ||
365 | [2] Matches only packets without an 802.1Q header. | |
366 | ||
367 | NXM: Any match with (vlan_tci == 0) and (vlan_tci_mask & 0x1000) | |
368 | != 0 is equivalent to the one listed in the table. | |
369 | ||
370 | OF1.0: The spec doesn't define behavior if dl_vlan is set to | |
371 | 0xffff and OFPFW_DL_VLAN_PCP is not set. | |
372 | ||
373 | OF1.1: The spec says explicitly to ignore dl_vlan_pcp when | |
374 | dl_vlan is set to 0xffff. | |
375 | ||
376 | OF1.2: The spec doesn't say what should happen if (vlan_vid == 0) | |
377 | and (vlan_vid_mask & 0x1000) != 0 but (vlan_vid_mask != 0x1000), | |
378 | but it would be straightforward to also interpret as [2]. | |
379 | ||
380 | [3] Matches only packets that have an 802.1Q header with VID xxx (and | |
381 | any PCP). | |
382 | ||
383 | [4] Matches only packets that have an 802.1Q header with PCP y (and | |
384 | any VID). | |
385 | ||
386 | NXM: z is ((y << 1) | 1). | |
387 | ||
388 | OF1.0: The spec isn't very clear, but OVS implements it this way. | |
389 | ||
390 | OF1.2: Presumably other masks such that (vlan_vid_mask & 0x1fff) | |
391 | == 0x1000 would also work, but the spec doesn't define their | |
392 | behavior. | |
393 | ||
394 | [5] Matches only packets that have an 802.1Q header with VID xxx and | |
395 | PCP y. | |
396 | ||
397 | NXM: z is ((y << 1) | 1). | |
398 | ||
399 | OF1.2: Presumably other masks such that (vlan_vid_mask & 0x1fff) | |
400 | == 0x1fff would also work. | |
401 | ||
402 | [6] Matches packets with no 802.1Q header or with an 802.1Q header | |
403 | with a VID of 0. Only possible with NXM. | |
404 | ||
405 | [7] Matches packets with no 802.1Q header or with an 802.1Q header | |
406 | with a PCP of 0. Only possible with NXM. | |
407 | ||
408 | [8] Matches packets with no 802.1Q header or with an 802.1Q header | |
409 | with both VID and PCP of 0. Only possible with NXM. | |
410 | ||
411 | [9] Matches only packets that have an 802.1Q header with an | |
412 | odd-numbered VID (and any PCP). Only possible with NXM and | |
413 | OF1.2. (This is just an example; one can match on any desired | |
414 | VID bit pattern.) | |
415 | ||
416 | [10] Matches only packets that have an 802.1Q header with an | |
417 | odd-numbered PCP (and any VID). Only possible with NXM. (This | |
418 | is just an example; one can match on any desired VID bit | |
419 | pattern.) | |
420 | ||
421 | Additional notes: | |
422 | ||
423 | - OF1.2: The top three bits of OXM_OF_VLAN_VID are fixed to zero, | |
424 | so bits 13, 14, and 15 in the masks listed in the table may be | |
425 | set to arbitrary values, as long as the corresponding value bits | |
426 | are also zero. The suggested ffff mask for [2], [3], and [5] | |
427 | allows a shorter OXM representation (the mask is omitted) than | |
428 | the minimal 1fff mask. | |
429 | ||
430 | ||
f66b87de BP |
431 | Flow Cookies |
432 | ============ | |
433 | ||
434 | OpenFlow 1.0 and later versions have the concept of a "flow cookie", | |
435 | which is a 64-bit integer value attached to each flow. The treatment | |
436 | of the flow cookie has varied greatly across OpenFlow versions, | |
437 | however. | |
438 | ||
439 | In OpenFlow 1.0: | |
440 | ||
441 | - OFPFC_ADD set the cookie in the flow that it added. | |
442 | ||
443 | - OFPFC_MODIFY and OFPFC_MODIFY_STRICT updated the cookie for | |
444 | the flow or flows that it modified. | |
445 | ||
446 | - OFPST_FLOW messages included the flow cookie. | |
447 | ||
448 | - OFPT_FLOW_REMOVED messages reported the cookie of the flow | |
449 | that was removed. | |
450 | ||
451 | OpenFlow 1.1 made the following changes: | |
452 | ||
453 | - Flow mod operations OFPFC_MODIFY, OFPFC_MODIFY_STRICT, | |
454 | OFPFC_DELETE, and OFPFC_DELETE_STRICT, plus flow stats | |
455 | requests and aggregate stats requests, gained the ability to | |
456 | match on flow cookies with an arbitrary mask. | |
457 | ||
458 | - OFPFC_MODIFY and OFPFC_MODIFY_STRICT were changed to add a | |
459 | new flow, in the case of no match, only if the flow table | |
460 | modification operation did not match on the cookie field. | |
461 | (In OpenFlow 1.0, modify operations always added a new flow | |
462 | when there was no match.) | |
463 | ||
464 | - OFPFC_MODIFY and OFPFC_MODIFY_STRICT no longer updated flow | |
465 | cookies. | |
466 | ||
467 | OpenFlow 1.2 made the following changes: | |
468 | ||
469 | - OFPC_MODIFY and OFPFC_MODIFY_STRICT were changed to never | |
470 | add a new flow, regardless of whether the flow cookie was | |
471 | used for matching. | |
472 | ||
473 | Open vSwitch support for OpenFlow 1.0 implements the OpenFlow 1.0 | |
474 | behavior with the following extensions: | |
475 | ||
476 | - An NXM extension field NXM_NX_COOKIE(_W) allows the NXM | |
477 | versions of OFPFC_MODIFY, OFPFC_MODIFY_STRICT, OFPFC_DELETE, | |
478 | and OFPFC_DELETE_STRICT flow_mods, plus flow stats requests | |
479 | and aggregate stats requests, to match on flow cookies with | |
480 | arbitrary masks. This is much like the equivalent OpenFlow | |
481 | 1.1 feature. | |
482 | ||
623e1caf JP |
483 | - Like OpenFlow 1.1, OFPC_MODIFY and OFPFC_MODIFY_STRICT add a |
484 | new flow if there is no match and the mask is zero (or not | |
485 | given). | |
486 | ||
487 | - The "cookie" field in OFPT_FLOW_MOD and NXT_FLOW_MOD messages | |
488 | is used as the cookie value for OFPFC_ADD commands, as | |
489 | described in OpenFlow 1.0. For OFPFC_MODIFY and | |
490 | OFPFC_MODIFY_STRICT commands, the "cookie" field is used as a | |
491 | new cookie for flows that match unless it is UINT64_MAX, in | |
492 | which case the flow's cookie is not updated. | |
f66b87de BP |
493 | |
494 | - NXT_PACKET_IN (the Nicira extended version of | |
495 | OFPT_PACKET_IN) reports the cookie of the rule that | |
496 | generated the packet, or all-1-bits if no rule generated the | |
497 | packet. (Older versions of OVS used all-0-bits instead of | |
498 | all-1-bits.) | |
499 | ||
623e1caf JP |
500 | The following table shows the handling of different protocols when |
501 | receiving OFPFC_MODIFY and OFPFC_MODIFY_STRICT messages. A mask of 0 | |
502 | indicates either an explicit mask of zero or an implicit one by not | |
503 | specifying the NXM_NX_COOKIE(_W) field. | |
504 | ||
505 | Match Update Add on miss Add on miss | |
506 | cookie cookie mask!=0 mask==0 | |
507 | ====== ====== =========== =========== | |
508 | OpenFlow 1.0 no yes <always add on miss> | |
509 | OpenFlow 1.1 yes no no yes | |
510 | OpenFlow 1.2 yes no no no | |
511 | NXM yes yes* no yes | |
512 | ||
513 | * Updates the flow's cookie unless the "cookie" field is UINT64_MAX. | |
514 | ||
f66b87de | 515 | |
66abb12b BP |
516 | Multiple Table Support |
517 | ====================== | |
518 | ||
519 | OpenFlow 1.0 has only rudimentary support for multiple flow tables. | |
520 | Notably, OpenFlow 1.0 does not allow the controller to specify the | |
521 | flow table to which a flow is to be added. Open vSwitch adds an | |
522 | extension for this purpose, which is enabled on a per-OpenFlow | |
523 | connection basis using the NXT_FLOW_MOD_TABLE_ID message. When the | |
524 | extension is enabled, the upper 8 bits of the 'command' member in an | |
525 | OFPT_FLOW_MOD or NXT_FLOW_MOD message designates the table to which a | |
526 | flow is to be added. | |
527 | ||
528 | The Open vSwitch software switch implementation offers 255 flow | |
529 | tables. On packet ingress, only the first flow table (table 0) is | |
530 | searched, and the contents of the remaining tables are not considered | |
531 | in any way. Tables other than table 0 only come into play when an | |
532 | NXAST_RESUBMIT_TABLE action specifies another table to search. | |
533 | ||
534 | Tables 128 and above are reserved for use by the switch itself. | |
535 | Controllers should use only tables 0 through 127. | |
536 | ||
537 | ||
d31f1109 JP |
538 | IPv6 |
539 | ==== | |
540 | ||
541 | Open vSwitch supports stateless handling of IPv6 packets. Flows can be | |
542 | written to support matching TCP, UDP, and ICMPv6 headers within an IPv6 | |
685a51a5 JP |
543 | packet. Deeper matching of some Neighbor Discovery messages is also |
544 | supported. | |
d31f1109 JP |
545 | |
546 | IPv6 was not designed to interact well with middle-boxes. This, | |
547 | combined with Open vSwitch's stateless nature, have affected the | |
548 | processing of IPv6 traffic, which is detailed below. | |
549 | ||
550 | Extension Headers | |
551 | ----------------- | |
552 | ||
553 | The base IPv6 header is incredibly simple with the intention of only | |
554 | containing information relevant for routing packets between two | |
555 | endpoints. IPv6 relies heavily on the use of extension headers to | |
556 | provide any other functionality. Unfortunately, the extension headers | |
557 | were designed in such a way that it is impossible to move to the next | |
558 | header (including the layer-4 payload) unless the current header is | |
559 | understood. | |
560 | ||
561 | Open vSwitch will process the following extension headers and continue | |
562 | to the next header: | |
563 | ||
564 | * Fragment (see the next section) | |
565 | * AH (Authentication Header) | |
566 | * Hop-by-Hop Options | |
567 | * Routing | |
568 | * Destination Options | |
569 | ||
570 | When a header is encountered that is not in that list, it is considered | |
571 | "terminal". A terminal header's IPv6 protocol value is stored in | |
572 | "nw_proto" for matching purposes. If a terminal header is TCP, UDP, or | |
573 | ICMPv6, the packet will be further processed in an attempt to extract | |
574 | layer-4 information. | |
575 | ||
576 | Fragments | |
577 | --------- | |
578 | ||
579 | IPv6 requires that every link in the internet have an MTU of 1280 octets | |
580 | or greater (RFC 2460). As such, a terminal header (as described above in | |
581 | "Extension Headers") in the first fragment should generally be | |
582 | reachable. In this case, the terminal header's IPv6 protocol type is | |
583 | stored in the "nw_proto" field for matching purposes. If a terminal | |
584 | header cannot be found in the first fragment (one with a fragment offset | |
585 | of zero), the "nw_proto" field is set to 0. Subsequent fragments (those | |
586 | with a non-zero fragment offset) have the "nw_proto" field set to the | |
587 | IPv6 protocol type for fragments (44). | |
588 | ||
589 | Jumbograms | |
590 | ---------- | |
591 | ||
592 | An IPv6 jumbogram (RFC 2675) is a packet containing a payload longer | |
593 | than 65,535 octets. A jumbogram is only relevant in subnets with a link | |
594 | MTU greater than 65,575 octets, and are not required to be supported on | |
595 | nodes that do not connect to link with such large MTUs. Currently, Open | |
596 | vSwitch doesn't process jumbograms. | |
597 | ||
598 | ||
946350dc BP |
599 | In-Band Control |
600 | =============== | |
601 | ||
56e9c3b9 BP |
602 | Motivation |
603 | ---------- | |
604 | ||
605 | An OpenFlow switch must establish and maintain a TCP network | |
606 | connection to its controller. There are two basic ways to categorize | |
607 | the network that this connection traverses: either it is completely | |
608 | separate from the one that the switch is otherwise controlling, or its | |
609 | path may overlap the network that the switch controls. We call the | |
610 | former case "out-of-band control", the latter case "in-band control". | |
611 | ||
612 | Out-of-band control has the following benefits: | |
613 | ||
614 | - Simplicity: Out-of-band control slightly simplifies the switch | |
615 | implementation. | |
616 | ||
617 | - Reliability: Excessive switch traffic volume cannot interfere | |
618 | with control traffic. | |
619 | ||
620 | - Integrity: Machines not on the control network cannot | |
621 | impersonate a switch or a controller. | |
622 | ||
623 | - Confidentiality: Machines not on the control network cannot | |
624 | snoop on control traffic. | |
625 | ||
626 | In-band control, on the other hand, has the following advantages: | |
627 | ||
628 | - No dedicated port: There is no need to dedicate a physical | |
629 | switch port to control, which is important on switches that have | |
630 | few ports (e.g. wireless routers, low-end embedded platforms). | |
631 | ||
632 | - No dedicated network: There is no need to build and maintain a | |
633 | separate control network. This is important in many | |
634 | environments because it reduces proliferation of switches and | |
635 | wiring. | |
636 | ||
637 | Open vSwitch supports both out-of-band and in-band control. This | |
638 | section describes the principles behind in-band control. See the | |
639 | description of the Controller table in ovs-vswitchd.conf.db(5) to | |
640 | configure OVS for in-band control. | |
641 | ||
642 | Principles | |
643 | ---------- | |
644 | ||
645 | The fundamental principle of in-band control is that an OpenFlow | |
646 | switch must recognize and switch control traffic without involving the | |
647 | OpenFlow controller. All the details of implementing in-band control | |
648 | are special cases of this principle. | |
649 | ||
650 | The rationale for this principle is simple. If the switch does not | |
651 | handle in-band control traffic itself, then it will be caught in a | |
652 | contradiction: it must contact the controller, but it cannot, because | |
653 | only the controller can set up the flows that are needed to contact | |
654 | the controller. | |
655 | ||
656 | The following points describe important special cases of this | |
657 | principle. | |
658 | ||
659 | - In-band control must be implemented regardless of whether the | |
660 | switch is connected. | |
661 | ||
662 | It is tempting to implement the in-band control rules only when | |
663 | the switch is not connected to the controller, using the | |
664 | reasoning that the controller should have complete control once | |
665 | it has established a connection with the switch. | |
666 | ||
667 | This does not work in practice. Consider the case where the | |
668 | switch is connected to the controller. Occasionally it can | |
669 | happen that the controller forgets or otherwise needs to obtain | |
670 | the MAC address of the switch. To do so, the controller sends a | |
671 | broadcast ARP request. A switch that implements the in-band | |
672 | control rules only when it is disconnected will then send an | |
673 | OFPT_PACKET_IN message up to the controller. The controller will | |
674 | be unable to respond, because it does not know the MAC address of | |
675 | the switch. This is a deadlock situation that can only be | |
676 | resolved by the switch noticing that its connection to the | |
677 | controller has hung and reconnecting. | |
678 | ||
679 | - In-band control must override flows set up by the controller. | |
680 | ||
681 | It is reasonable to assume that flows set up by the OpenFlow | |
682 | controller should take precedence over in-band control, on the | |
683 | basis that the controller should be in charge of the switch. | |
684 | ||
685 | Again, this does not work in practice. Reasonable controller | |
686 | implementations may set up a "last resort" fallback rule that | |
687 | wildcards every field and, e.g., sends it up to the controller or | |
688 | discards it. If a controller does that, then it will isolate | |
689 | itself from the switch. | |
690 | ||
691 | - The switch must recognize all control traffic. | |
692 | ||
693 | The fundamental principle of in-band control states, in part, | |
694 | that a switch must recognize control traffic without involving | |
695 | the OpenFlow controller. More specifically, the switch must | |
696 | recognize *all* control traffic. "False negatives", that is, | |
697 | packets that constitute control traffic but that the switch does | |
698 | not recognize as control traffic, lead to control traffic storms. | |
699 | ||
700 | Consider an OpenFlow switch that only recognizes control packets | |
701 | sent to or from that switch. Now suppose that two switches of | |
702 | this type, named A and B, are connected to ports on an Ethernet | |
703 | hub (not a switch) and that an OpenFlow controller is connected | |
704 | to a third hub port. In this setup, control traffic sent by | |
705 | switch A will be seen by switch B, which will send it to the | |
706 | controller as part of an OFPT_PACKET_IN message. Switch A will | |
707 | then see the OFPT_PACKET_IN message's packet, re-encapsulate it | |
708 | in another OFPT_PACKET_IN, and send it to the controller. Switch | |
709 | B will then see that OFPT_PACKET_IN, and so on in an infinite | |
710 | loop. | |
711 | ||
712 | Incidentally, the consequences of "false positives", where | |
713 | packets that are not control traffic are nevertheless recognized | |
714 | as control traffic, are much less severe. The controller will | |
715 | not be able to control their behavior, but the network will | |
716 | remain in working order. False positives do constitute a | |
717 | security problem. | |
718 | ||
719 | - The switch should use echo-requests to detect disconnection. | |
720 | ||
721 | TCP will notice that a connection has hung, but this can take a | |
722 | considerable amount of time. For example, with default settings | |
723 | the Linux kernel TCP implementation will retransmit for between | |
724 | 13 and 30 minutes, depending on the connection's retransmission | |
725 | timeout, according to kernel documentation. This is far too long | |
726 | for a switch to be disconnected, so an OpenFlow switch should | |
727 | implement its own connection timeout. OpenFlow OFPT_ECHO_REQUEST | |
728 | messages are the best way to do this, since they test the | |
729 | OpenFlow connection itself. | |
730 | ||
731 | Implementation | |
732 | -------------- | |
733 | ||
734 | This section describes how Open vSwitch implements in-band control. | |
735 | Correctly implementing in-band control has proven difficult due to its | |
736 | many subtleties, and has thus gone through many iterations. Please | |
737 | read through and understand the reasoning behind the chosen rules | |
738 | before making modifications. | |
739 | ||
740 | Open vSwitch implements in-band control as "hidden" flows, that is, | |
741 | flows that are not visible through OpenFlow, and at a higher priority | |
742 | than wildcarded flows can be set up through OpenFlow. This is done so | |
743 | that the OpenFlow controller cannot interfere with them and possibly | |
744 | break connectivity with its switches. It is possible to see all | |
745 | flows, including in-band ones, with the ovs-appctl "bridge/dump-flows" | |
746 | command. | |
946350dc BP |
747 | |
748 | The Open vSwitch implementation of in-band control can hide traffic to | |
749 | arbitrary "remotes", where each remote is one TCP port on one IP address. | |
750 | Currently the remotes are automatically configured as the in-band OpenFlow | |
751 | controllers plus the OVSDB managers, if any. (The latter is a requirement | |
752 | because OVSDB managers are responsible for configuring OpenFlow controllers, | |
753 | so if the manager cannot be reached then OpenFlow cannot be reconfigured.) | |
754 | ||
755 | The following rules (with the OFPP_NORMAL action) are set up on any bridge | |
756 | that has any remotes: | |
757 | ||
758 | (a) DHCP requests sent from the local port. | |
759 | (b) ARP replies to the local port's MAC address. | |
760 | (c) ARP requests from the local port's MAC address. | |
761 | ||
762 | In-band also sets up the following rules for each unique next-hop MAC | |
763 | address for the remotes' IPs (the "next hop" is either the remote | |
764 | itself, if it is on a local subnet, or the gateway to reach the remote): | |
765 | ||
766 | (d) ARP replies to the next hop's MAC address. | |
767 | (e) ARP requests from the next hop's MAC address. | |
768 | ||
769 | In-band also sets up the following rules for each unique remote IP address: | |
770 | ||
771 | (f) ARP replies containing the remote's IP address as a target. | |
772 | (g) ARP requests containing the remote's IP address as a source. | |
773 | ||
774 | In-band also sets up the following rules for each unique remote (IP,port) | |
775 | pair: | |
776 | ||
777 | (h) TCP traffic to the remote's IP and port. | |
778 | (i) TCP traffic from the remote's IP and port. | |
779 | ||
780 | The goal of these rules is to be as narrow as possible to allow a | |
781 | switch to join a network and be able to communicate with the | |
782 | remotes. As mentioned earlier, these rules have higher priority | |
783 | than the controller's rules, so if they are too broad, they may | |
784 | prevent the controller from implementing its policy. As such, | |
785 | in-band actively monitors some aspects of flow and packet processing | |
786 | so that the rules can be made more precise. | |
787 | ||
788 | In-band control monitors attempts to add flows into the datapath that | |
789 | could interfere with its duties. The datapath only allows exact | |
790 | match entries, so in-band control is able to be very precise about | |
791 | the flows it prevents. Flows that miss in the datapath are sent to | |
792 | userspace to be processed, so preventing these flows from being | |
793 | cached in the "fast path" does not affect correctness. The only type | |
794 | of flow that is currently prevented is one that would prevent DHCP | |
795 | replies from being seen by the local port. For example, a rule that | |
796 | forwarded all DHCP traffic to the controller would not be allowed, | |
797 | but one that forwarded to all ports (including the local port) would. | |
798 | ||
799 | As mentioned earlier, packets that miss in the datapath are sent to | |
800 | the userspace for processing. The userspace has its own flow table, | |
801 | the "classifier", so in-band checks whether any special processing | |
802 | is needed before the classifier is consulted. If a packet is a DHCP | |
803 | response to a request from the local port, the packet is forwarded to | |
804 | the local port, regardless of the flow table. Note that this requires | |
805 | L7 processing of DHCP replies to determine whether the 'chaddr' field | |
806 | matches the MAC address of the local port. | |
807 | ||
808 | It is interesting to note that for an L3-based in-band control | |
809 | mechanism, the majority of rules are devoted to ARP traffic. At first | |
810 | glance, some of these rules appear redundant. However, each serves an | |
811 | important role. First, in order to determine the MAC address of the | |
812 | remote side (controller or gateway) for other ARP rules, we must allow | |
813 | ARP traffic for our local port with rules (b) and (c). If we are | |
814 | between a switch and its connection to the remote, we have to | |
815 | allow the other switch's ARP traffic to through. This is done with | |
816 | rules (d) and (e), since we do not know the addresses of the other | |
817 | switches a priori, but do know the remote's or gateway's. Finally, | |
818 | if the remote is running in a local guest VM that is not reached | |
819 | through the local port, the switch that is connected to the VM must | |
820 | allow ARP traffic based on the remote's IP address, since it will | |
821 | not know the MAC address of the local port that is sending the traffic | |
822 | or the MAC address of the remote in the guest VM. | |
823 | ||
824 | With a few notable exceptions below, in-band should work in most | |
825 | network setups. The following are considered "supported' in the | |
826 | current implementation: | |
827 | ||
828 | - Locally Connected. The switch and remote are on the same | |
829 | subnet. This uses rules (a), (b), (c), (h), and (i). | |
830 | ||
831 | - Reached through Gateway. The switch and remote are on | |
832 | different subnets and must go through a gateway. This uses | |
833 | rules (a), (b), (c), (h), and (i). | |
834 | ||
835 | - Between Switch and Remote. This switch is between another | |
836 | switch and the remote, and we want to allow the other | |
837 | switch's traffic through. This uses rules (d), (e), (h), and | |
838 | (i). It uses (b) and (c) indirectly in order to know the MAC | |
839 | address for rules (d) and (e). Note that DHCP for the other | |
840 | switch will not work unless an OpenFlow controller explicitly lets this | |
841 | switch pass the traffic. | |
842 | ||
843 | - Between Switch and Gateway. This switch is between another | |
844 | switch and the gateway, and we want to allow the other switch's | |
845 | traffic through. This uses the same rules and logic as the | |
846 | "Between Switch and Remote" configuration described earlier. | |
847 | ||
848 | - Remote on Local VM. The remote is a guest VM on the | |
849 | system running in-band control. This uses rules (a), (b), (c), | |
850 | (h), and (i). | |
851 | ||
852 | - Remote on Local VM with Different Networks. The remote | |
853 | is a guest VM on the system running in-band control, but the | |
854 | local port is not used to connect to the remote. For | |
855 | example, an IP address is configured on eth0 of the switch. The | |
856 | remote's VM is connected through eth1 of the switch, but an | |
857 | IP address has not been configured for that port on the switch. | |
858 | As such, the switch will use eth0 to connect to the remote, | |
859 | and eth1's rules about the local port will not work. In the | |
860 | example, the switch attached to eth0 would use rules (a), (b), | |
861 | (c), (h), and (i) on eth0. The switch attached to eth1 would use | |
862 | rules (f), (g), (h), and (i). | |
863 | ||
864 | The following are explicitly *not* supported by in-band control: | |
865 | ||
866 | - Specify Remote by Name. Currently, the remote must be | |
867 | identified by IP address. A naive approach would be to permit | |
868 | all DNS traffic. Unfortunately, this would prevent the | |
869 | controller from defining any policy over DNS. Since switches | |
870 | that are located behind us need to connect to the remote, | |
871 | in-band cannot simply add a rule that allows DNS traffic from | |
872 | the local port. The "correct" way to support this is to parse | |
873 | DNS requests to allow all traffic related to a request for the | |
874 | remote's name through. Due to the potential security | |
875 | problems and amount of processing, we decided to hold off for | |
876 | the time-being. | |
877 | ||
878 | - Differing Remotes for Switches. All switches must know | |
879 | the L3 addresses for all the remotes that other switches | |
880 | may use, since rules need to be set up to allow traffic related | |
881 | to those remotes through. See rules (f), (g), (h), and (i). | |
882 | ||
883 | - Differing Routes for Switches. In order for the switch to | |
884 | allow other switches to connect to a remote through a | |
885 | gateway, it allows the gateway's traffic through with rules (d) | |
886 | and (e). If the routes to the remote differ for the two | |
887 | switches, we will not know the MAC address of the alternate | |
888 | gateway. | |
889 | ||
890 | ||
f25d0cf3 BP |
891 | Action Reproduction |
892 | =================== | |
893 | ||
894 | It seems likely that many controllers, at least at startup, use the | |
895 | OpenFlow "flow statistics" request to obtain existing flows, then | |
896 | compare the flows' actions against the actions that they expect to | |
897 | find. Before version 1.8.0, Open vSwitch always returned exact, | |
898 | byte-for-byte copies of the actions that had been added to the flow | |
899 | table. The current version of Open vSwitch does not always do this in | |
900 | some exceptional cases. This section lists the exceptions that | |
901 | controller authors must keep in mind if they compare actual actions | |
902 | against desired actions in a bytewise fashion: | |
903 | ||
904 | - Open vSwitch zeros padding bytes in action structures, | |
905 | regardless of their values when the flows were added. | |
906 | ||
d01c980f BP |
907 | - Open vSwitch "normalizes" the instructions in OpenFlow 1.1 |
908 | (and later) in the following way: | |
909 | ||
910 | * OVS sorts the instructions into the following order: | |
911 | Apply-Actions, Clear-Actions, Write-Actions, | |
912 | Write-Metadata, Goto-Table. | |
913 | ||
914 | * OVS drops Apply-Actions instructions that have empty | |
915 | action lists. | |
916 | ||
917 | * OVS drops Write-Actions instructions that have empty | |
918 | action sets. | |
919 | ||
f25d0cf3 BP |
920 | Please report other discrepancies, if you notice any, so that we can |
921 | fix or document them. | |
922 | ||
923 | ||
d31f1109 JP |
924 | Suggestions |
925 | =========== | |
926 | ||
927 | Suggestions to improve Open vSwitch are welcome at discuss@openvswitch.org. |