]>
Commit | Line | Data |
---|---|---|
d31f1109 JP |
1 | Design Decisions In Open vSwitch |
2 | ================================ | |
3 | ||
4 | This document describes design decisions that went into implementing | |
5 | Open vSwitch. While we believe these to be reasonable decisions, it is | |
6 | impossible to predict how Open vSwitch will be used in all environments. | |
7 | Understanding assumptions made by Open vSwitch is critical to a | |
8 | successful deployment. The end of this document contains contact | |
9 | information that can be used to let us know how we can make Open vSwitch | |
10 | more generally useful. | |
11 | ||
12 | ||
66abb12b BP |
13 | Multiple Table Support |
14 | ====================== | |
15 | ||
16 | OpenFlow 1.0 has only rudimentary support for multiple flow tables. | |
17 | Notably, OpenFlow 1.0 does not allow the controller to specify the | |
18 | flow table to which a flow is to be added. Open vSwitch adds an | |
19 | extension for this purpose, which is enabled on a per-OpenFlow | |
20 | connection basis using the NXT_FLOW_MOD_TABLE_ID message. When the | |
21 | extension is enabled, the upper 8 bits of the 'command' member in an | |
22 | OFPT_FLOW_MOD or NXT_FLOW_MOD message designates the table to which a | |
23 | flow is to be added. | |
24 | ||
25 | The Open vSwitch software switch implementation offers 255 flow | |
26 | tables. On packet ingress, only the first flow table (table 0) is | |
27 | searched, and the contents of the remaining tables are not considered | |
28 | in any way. Tables other than table 0 only come into play when an | |
29 | NXAST_RESUBMIT_TABLE action specifies another table to search. | |
30 | ||
31 | Tables 128 and above are reserved for use by the switch itself. | |
32 | Controllers should use only tables 0 through 127. | |
33 | ||
34 | ||
d31f1109 JP |
35 | IPv6 |
36 | ==== | |
37 | ||
38 | Open vSwitch supports stateless handling of IPv6 packets. Flows can be | |
39 | written to support matching TCP, UDP, and ICMPv6 headers within an IPv6 | |
685a51a5 JP |
40 | packet. Deeper matching of some Neighbor Discovery messages is also |
41 | supported. | |
d31f1109 JP |
42 | |
43 | IPv6 was not designed to interact well with middle-boxes. This, | |
44 | combined with Open vSwitch's stateless nature, have affected the | |
45 | processing of IPv6 traffic, which is detailed below. | |
46 | ||
47 | Extension Headers | |
48 | ----------------- | |
49 | ||
50 | The base IPv6 header is incredibly simple with the intention of only | |
51 | containing information relevant for routing packets between two | |
52 | endpoints. IPv6 relies heavily on the use of extension headers to | |
53 | provide any other functionality. Unfortunately, the extension headers | |
54 | were designed in such a way that it is impossible to move to the next | |
55 | header (including the layer-4 payload) unless the current header is | |
56 | understood. | |
57 | ||
58 | Open vSwitch will process the following extension headers and continue | |
59 | to the next header: | |
60 | ||
61 | * Fragment (see the next section) | |
62 | * AH (Authentication Header) | |
63 | * Hop-by-Hop Options | |
64 | * Routing | |
65 | * Destination Options | |
66 | ||
67 | When a header is encountered that is not in that list, it is considered | |
68 | "terminal". A terminal header's IPv6 protocol value is stored in | |
69 | "nw_proto" for matching purposes. If a terminal header is TCP, UDP, or | |
70 | ICMPv6, the packet will be further processed in an attempt to extract | |
71 | layer-4 information. | |
72 | ||
73 | Fragments | |
74 | --------- | |
75 | ||
76 | IPv6 requires that every link in the internet have an MTU of 1280 octets | |
77 | or greater (RFC 2460). As such, a terminal header (as described above in | |
78 | "Extension Headers") in the first fragment should generally be | |
79 | reachable. In this case, the terminal header's IPv6 protocol type is | |
80 | stored in the "nw_proto" field for matching purposes. If a terminal | |
81 | header cannot be found in the first fragment (one with a fragment offset | |
82 | of zero), the "nw_proto" field is set to 0. Subsequent fragments (those | |
83 | with a non-zero fragment offset) have the "nw_proto" field set to the | |
84 | IPv6 protocol type for fragments (44). | |
85 | ||
86 | Jumbograms | |
87 | ---------- | |
88 | ||
89 | An IPv6 jumbogram (RFC 2675) is a packet containing a payload longer | |
90 | than 65,535 octets. A jumbogram is only relevant in subnets with a link | |
91 | MTU greater than 65,575 octets, and are not required to be supported on | |
92 | nodes that do not connect to link with such large MTUs. Currently, Open | |
93 | vSwitch doesn't process jumbograms. | |
94 | ||
95 | ||
946350dc BP |
96 | In-Band Control |
97 | =============== | |
98 | ||
99 | In-band control allows a single network to be used for OpenFlow traffic and | |
100 | other data traffic. See ovs-vswitchd.conf.db(5) for a description of | |
101 | configuring in-band control. | |
102 | ||
103 | This comment is an attempt to describe how in-band control works at a | |
104 | wire- and implementation-level. Correctly implementing in-band | |
105 | control has proven difficult due to its many subtleties, and has thus | |
106 | gone through many iterations. Please read through and understand the | |
107 | reasoning behind the chosen rules before making modifications. | |
108 | ||
109 | In Open vSwitch, in-band control is implemented as "hidden" flows (in that | |
110 | they are not visible through OpenFlow) and at a higher priority than | |
111 | wildcarded flows can be set up by through OpenFlow. This is done so that | |
112 | the OpenFlow controller cannot interfere with them and possibly break | |
113 | connectivity with its switches. It is possible to see all flows, including | |
114 | in-band ones, with the ovs-appctl "bridge/dump-flows" command. | |
115 | ||
116 | The Open vSwitch implementation of in-band control can hide traffic to | |
117 | arbitrary "remotes", where each remote is one TCP port on one IP address. | |
118 | Currently the remotes are automatically configured as the in-band OpenFlow | |
119 | controllers plus the OVSDB managers, if any. (The latter is a requirement | |
120 | because OVSDB managers are responsible for configuring OpenFlow controllers, | |
121 | so if the manager cannot be reached then OpenFlow cannot be reconfigured.) | |
122 | ||
123 | The following rules (with the OFPP_NORMAL action) are set up on any bridge | |
124 | that has any remotes: | |
125 | ||
126 | (a) DHCP requests sent from the local port. | |
127 | (b) ARP replies to the local port's MAC address. | |
128 | (c) ARP requests from the local port's MAC address. | |
129 | ||
130 | In-band also sets up the following rules for each unique next-hop MAC | |
131 | address for the remotes' IPs (the "next hop" is either the remote | |
132 | itself, if it is on a local subnet, or the gateway to reach the remote): | |
133 | ||
134 | (d) ARP replies to the next hop's MAC address. | |
135 | (e) ARP requests from the next hop's MAC address. | |
136 | ||
137 | In-band also sets up the following rules for each unique remote IP address: | |
138 | ||
139 | (f) ARP replies containing the remote's IP address as a target. | |
140 | (g) ARP requests containing the remote's IP address as a source. | |
141 | ||
142 | In-band also sets up the following rules for each unique remote (IP,port) | |
143 | pair: | |
144 | ||
145 | (h) TCP traffic to the remote's IP and port. | |
146 | (i) TCP traffic from the remote's IP and port. | |
147 | ||
148 | The goal of these rules is to be as narrow as possible to allow a | |
149 | switch to join a network and be able to communicate with the | |
150 | remotes. As mentioned earlier, these rules have higher priority | |
151 | than the controller's rules, so if they are too broad, they may | |
152 | prevent the controller from implementing its policy. As such, | |
153 | in-band actively monitors some aspects of flow and packet processing | |
154 | so that the rules can be made more precise. | |
155 | ||
156 | In-band control monitors attempts to add flows into the datapath that | |
157 | could interfere with its duties. The datapath only allows exact | |
158 | match entries, so in-band control is able to be very precise about | |
159 | the flows it prevents. Flows that miss in the datapath are sent to | |
160 | userspace to be processed, so preventing these flows from being | |
161 | cached in the "fast path" does not affect correctness. The only type | |
162 | of flow that is currently prevented is one that would prevent DHCP | |
163 | replies from being seen by the local port. For example, a rule that | |
164 | forwarded all DHCP traffic to the controller would not be allowed, | |
165 | but one that forwarded to all ports (including the local port) would. | |
166 | ||
167 | As mentioned earlier, packets that miss in the datapath are sent to | |
168 | the userspace for processing. The userspace has its own flow table, | |
169 | the "classifier", so in-band checks whether any special processing | |
170 | is needed before the classifier is consulted. If a packet is a DHCP | |
171 | response to a request from the local port, the packet is forwarded to | |
172 | the local port, regardless of the flow table. Note that this requires | |
173 | L7 processing of DHCP replies to determine whether the 'chaddr' field | |
174 | matches the MAC address of the local port. | |
175 | ||
176 | It is interesting to note that for an L3-based in-band control | |
177 | mechanism, the majority of rules are devoted to ARP traffic. At first | |
178 | glance, some of these rules appear redundant. However, each serves an | |
179 | important role. First, in order to determine the MAC address of the | |
180 | remote side (controller or gateway) for other ARP rules, we must allow | |
181 | ARP traffic for our local port with rules (b) and (c). If we are | |
182 | between a switch and its connection to the remote, we have to | |
183 | allow the other switch's ARP traffic to through. This is done with | |
184 | rules (d) and (e), since we do not know the addresses of the other | |
185 | switches a priori, but do know the remote's or gateway's. Finally, | |
186 | if the remote is running in a local guest VM that is not reached | |
187 | through the local port, the switch that is connected to the VM must | |
188 | allow ARP traffic based on the remote's IP address, since it will | |
189 | not know the MAC address of the local port that is sending the traffic | |
190 | or the MAC address of the remote in the guest VM. | |
191 | ||
192 | With a few notable exceptions below, in-band should work in most | |
193 | network setups. The following are considered "supported' in the | |
194 | current implementation: | |
195 | ||
196 | - Locally Connected. The switch and remote are on the same | |
197 | subnet. This uses rules (a), (b), (c), (h), and (i). | |
198 | ||
199 | - Reached through Gateway. The switch and remote are on | |
200 | different subnets and must go through a gateway. This uses | |
201 | rules (a), (b), (c), (h), and (i). | |
202 | ||
203 | - Between Switch and Remote. This switch is between another | |
204 | switch and the remote, and we want to allow the other | |
205 | switch's traffic through. This uses rules (d), (e), (h), and | |
206 | (i). It uses (b) and (c) indirectly in order to know the MAC | |
207 | address for rules (d) and (e). Note that DHCP for the other | |
208 | switch will not work unless an OpenFlow controller explicitly lets this | |
209 | switch pass the traffic. | |
210 | ||
211 | - Between Switch and Gateway. This switch is between another | |
212 | switch and the gateway, and we want to allow the other switch's | |
213 | traffic through. This uses the same rules and logic as the | |
214 | "Between Switch and Remote" configuration described earlier. | |
215 | ||
216 | - Remote on Local VM. The remote is a guest VM on the | |
217 | system running in-band control. This uses rules (a), (b), (c), | |
218 | (h), and (i). | |
219 | ||
220 | - Remote on Local VM with Different Networks. The remote | |
221 | is a guest VM on the system running in-band control, but the | |
222 | local port is not used to connect to the remote. For | |
223 | example, an IP address is configured on eth0 of the switch. The | |
224 | remote's VM is connected through eth1 of the switch, but an | |
225 | IP address has not been configured for that port on the switch. | |
226 | As such, the switch will use eth0 to connect to the remote, | |
227 | and eth1's rules about the local port will not work. In the | |
228 | example, the switch attached to eth0 would use rules (a), (b), | |
229 | (c), (h), and (i) on eth0. The switch attached to eth1 would use | |
230 | rules (f), (g), (h), and (i). | |
231 | ||
232 | The following are explicitly *not* supported by in-band control: | |
233 | ||
234 | - Specify Remote by Name. Currently, the remote must be | |
235 | identified by IP address. A naive approach would be to permit | |
236 | all DNS traffic. Unfortunately, this would prevent the | |
237 | controller from defining any policy over DNS. Since switches | |
238 | that are located behind us need to connect to the remote, | |
239 | in-band cannot simply add a rule that allows DNS traffic from | |
240 | the local port. The "correct" way to support this is to parse | |
241 | DNS requests to allow all traffic related to a request for the | |
242 | remote's name through. Due to the potential security | |
243 | problems and amount of processing, we decided to hold off for | |
244 | the time-being. | |
245 | ||
246 | - Differing Remotes for Switches. All switches must know | |
247 | the L3 addresses for all the remotes that other switches | |
248 | may use, since rules need to be set up to allow traffic related | |
249 | to those remotes through. See rules (f), (g), (h), and (i). | |
250 | ||
251 | - Differing Routes for Switches. In order for the switch to | |
252 | allow other switches to connect to a remote through a | |
253 | gateway, it allows the gateway's traffic through with rules (d) | |
254 | and (e). If the routes to the remote differ for the two | |
255 | switches, we will not know the MAC address of the alternate | |
256 | gateway. | |
257 | ||
258 | ||
d31f1109 JP |
259 | Suggestions |
260 | =========== | |
261 | ||
262 | Suggestions to improve Open vSwitch are welcome at discuss@openvswitch.org. |