]>
Commit | Line | Data |
---|---|---|
d31f1109 JP |
1 | Design Decisions In Open vSwitch |
2 | ================================ | |
3 | ||
4 | This document describes design decisions that went into implementing | |
5 | Open vSwitch. While we believe these to be reasonable decisions, it is | |
6 | impossible to predict how Open vSwitch will be used in all environments. | |
7 | Understanding assumptions made by Open vSwitch is critical to a | |
8 | successful deployment. The end of this document contains contact | |
9 | information that can be used to let us know how we can make Open vSwitch | |
10 | more generally useful. | |
11 | ||
12 | ||
13 | IPv6 | |
14 | ==== | |
15 | ||
16 | Open vSwitch supports stateless handling of IPv6 packets. Flows can be | |
17 | written to support matching TCP, UDP, and ICMPv6 headers within an IPv6 | |
685a51a5 JP |
18 | packet. Deeper matching of some Neighbor Discovery messages is also |
19 | supported. | |
d31f1109 JP |
20 | |
21 | IPv6 was not designed to interact well with middle-boxes. This, | |
22 | combined with Open vSwitch's stateless nature, have affected the | |
23 | processing of IPv6 traffic, which is detailed below. | |
24 | ||
25 | Extension Headers | |
26 | ----------------- | |
27 | ||
28 | The base IPv6 header is incredibly simple with the intention of only | |
29 | containing information relevant for routing packets between two | |
30 | endpoints. IPv6 relies heavily on the use of extension headers to | |
31 | provide any other functionality. Unfortunately, the extension headers | |
32 | were designed in such a way that it is impossible to move to the next | |
33 | header (including the layer-4 payload) unless the current header is | |
34 | understood. | |
35 | ||
36 | Open vSwitch will process the following extension headers and continue | |
37 | to the next header: | |
38 | ||
39 | * Fragment (see the next section) | |
40 | * AH (Authentication Header) | |
41 | * Hop-by-Hop Options | |
42 | * Routing | |
43 | * Destination Options | |
44 | ||
45 | When a header is encountered that is not in that list, it is considered | |
46 | "terminal". A terminal header's IPv6 protocol value is stored in | |
47 | "nw_proto" for matching purposes. If a terminal header is TCP, UDP, or | |
48 | ICMPv6, the packet will be further processed in an attempt to extract | |
49 | layer-4 information. | |
50 | ||
51 | Fragments | |
52 | --------- | |
53 | ||
54 | IPv6 requires that every link in the internet have an MTU of 1280 octets | |
55 | or greater (RFC 2460). As such, a terminal header (as described above in | |
56 | "Extension Headers") in the first fragment should generally be | |
57 | reachable. In this case, the terminal header's IPv6 protocol type is | |
58 | stored in the "nw_proto" field for matching purposes. If a terminal | |
59 | header cannot be found in the first fragment (one with a fragment offset | |
60 | of zero), the "nw_proto" field is set to 0. Subsequent fragments (those | |
61 | with a non-zero fragment offset) have the "nw_proto" field set to the | |
62 | IPv6 protocol type for fragments (44). | |
63 | ||
64 | Jumbograms | |
65 | ---------- | |
66 | ||
67 | An IPv6 jumbogram (RFC 2675) is a packet containing a payload longer | |
68 | than 65,535 octets. A jumbogram is only relevant in subnets with a link | |
69 | MTU greater than 65,575 octets, and are not required to be supported on | |
70 | nodes that do not connect to link with such large MTUs. Currently, Open | |
71 | vSwitch doesn't process jumbograms. | |
72 | ||
73 | ||
946350dc BP |
74 | In-Band Control |
75 | =============== | |
76 | ||
77 | In-band control allows a single network to be used for OpenFlow traffic and | |
78 | other data traffic. See ovs-vswitchd.conf.db(5) for a description of | |
79 | configuring in-band control. | |
80 | ||
81 | This comment is an attempt to describe how in-band control works at a | |
82 | wire- and implementation-level. Correctly implementing in-band | |
83 | control has proven difficult due to its many subtleties, and has thus | |
84 | gone through many iterations. Please read through and understand the | |
85 | reasoning behind the chosen rules before making modifications. | |
86 | ||
87 | In Open vSwitch, in-band control is implemented as "hidden" flows (in that | |
88 | they are not visible through OpenFlow) and at a higher priority than | |
89 | wildcarded flows can be set up by through OpenFlow. This is done so that | |
90 | the OpenFlow controller cannot interfere with them and possibly break | |
91 | connectivity with its switches. It is possible to see all flows, including | |
92 | in-band ones, with the ovs-appctl "bridge/dump-flows" command. | |
93 | ||
94 | The Open vSwitch implementation of in-band control can hide traffic to | |
95 | arbitrary "remotes", where each remote is one TCP port on one IP address. | |
96 | Currently the remotes are automatically configured as the in-band OpenFlow | |
97 | controllers plus the OVSDB managers, if any. (The latter is a requirement | |
98 | because OVSDB managers are responsible for configuring OpenFlow controllers, | |
99 | so if the manager cannot be reached then OpenFlow cannot be reconfigured.) | |
100 | ||
101 | The following rules (with the OFPP_NORMAL action) are set up on any bridge | |
102 | that has any remotes: | |
103 | ||
104 | (a) DHCP requests sent from the local port. | |
105 | (b) ARP replies to the local port's MAC address. | |
106 | (c) ARP requests from the local port's MAC address. | |
107 | ||
108 | In-band also sets up the following rules for each unique next-hop MAC | |
109 | address for the remotes' IPs (the "next hop" is either the remote | |
110 | itself, if it is on a local subnet, or the gateway to reach the remote): | |
111 | ||
112 | (d) ARP replies to the next hop's MAC address. | |
113 | (e) ARP requests from the next hop's MAC address. | |
114 | ||
115 | In-band also sets up the following rules for each unique remote IP address: | |
116 | ||
117 | (f) ARP replies containing the remote's IP address as a target. | |
118 | (g) ARP requests containing the remote's IP address as a source. | |
119 | ||
120 | In-band also sets up the following rules for each unique remote (IP,port) | |
121 | pair: | |
122 | ||
123 | (h) TCP traffic to the remote's IP and port. | |
124 | (i) TCP traffic from the remote's IP and port. | |
125 | ||
126 | The goal of these rules is to be as narrow as possible to allow a | |
127 | switch to join a network and be able to communicate with the | |
128 | remotes. As mentioned earlier, these rules have higher priority | |
129 | than the controller's rules, so if they are too broad, they may | |
130 | prevent the controller from implementing its policy. As such, | |
131 | in-band actively monitors some aspects of flow and packet processing | |
132 | so that the rules can be made more precise. | |
133 | ||
134 | In-band control monitors attempts to add flows into the datapath that | |
135 | could interfere with its duties. The datapath only allows exact | |
136 | match entries, so in-band control is able to be very precise about | |
137 | the flows it prevents. Flows that miss in the datapath are sent to | |
138 | userspace to be processed, so preventing these flows from being | |
139 | cached in the "fast path" does not affect correctness. The only type | |
140 | of flow that is currently prevented is one that would prevent DHCP | |
141 | replies from being seen by the local port. For example, a rule that | |
142 | forwarded all DHCP traffic to the controller would not be allowed, | |
143 | but one that forwarded to all ports (including the local port) would. | |
144 | ||
145 | As mentioned earlier, packets that miss in the datapath are sent to | |
146 | the userspace for processing. The userspace has its own flow table, | |
147 | the "classifier", so in-band checks whether any special processing | |
148 | is needed before the classifier is consulted. If a packet is a DHCP | |
149 | response to a request from the local port, the packet is forwarded to | |
150 | the local port, regardless of the flow table. Note that this requires | |
151 | L7 processing of DHCP replies to determine whether the 'chaddr' field | |
152 | matches the MAC address of the local port. | |
153 | ||
154 | It is interesting to note that for an L3-based in-band control | |
155 | mechanism, the majority of rules are devoted to ARP traffic. At first | |
156 | glance, some of these rules appear redundant. However, each serves an | |
157 | important role. First, in order to determine the MAC address of the | |
158 | remote side (controller or gateway) for other ARP rules, we must allow | |
159 | ARP traffic for our local port with rules (b) and (c). If we are | |
160 | between a switch and its connection to the remote, we have to | |
161 | allow the other switch's ARP traffic to through. This is done with | |
162 | rules (d) and (e), since we do not know the addresses of the other | |
163 | switches a priori, but do know the remote's or gateway's. Finally, | |
164 | if the remote is running in a local guest VM that is not reached | |
165 | through the local port, the switch that is connected to the VM must | |
166 | allow ARP traffic based on the remote's IP address, since it will | |
167 | not know the MAC address of the local port that is sending the traffic | |
168 | or the MAC address of the remote in the guest VM. | |
169 | ||
170 | With a few notable exceptions below, in-band should work in most | |
171 | network setups. The following are considered "supported' in the | |
172 | current implementation: | |
173 | ||
174 | - Locally Connected. The switch and remote are on the same | |
175 | subnet. This uses rules (a), (b), (c), (h), and (i). | |
176 | ||
177 | - Reached through Gateway. The switch and remote are on | |
178 | different subnets and must go through a gateway. This uses | |
179 | rules (a), (b), (c), (h), and (i). | |
180 | ||
181 | - Between Switch and Remote. This switch is between another | |
182 | switch and the remote, and we want to allow the other | |
183 | switch's traffic through. This uses rules (d), (e), (h), and | |
184 | (i). It uses (b) and (c) indirectly in order to know the MAC | |
185 | address for rules (d) and (e). Note that DHCP for the other | |
186 | switch will not work unless an OpenFlow controller explicitly lets this | |
187 | switch pass the traffic. | |
188 | ||
189 | - Between Switch and Gateway. This switch is between another | |
190 | switch and the gateway, and we want to allow the other switch's | |
191 | traffic through. This uses the same rules and logic as the | |
192 | "Between Switch and Remote" configuration described earlier. | |
193 | ||
194 | - Remote on Local VM. The remote is a guest VM on the | |
195 | system running in-band control. This uses rules (a), (b), (c), | |
196 | (h), and (i). | |
197 | ||
198 | - Remote on Local VM with Different Networks. The remote | |
199 | is a guest VM on the system running in-band control, but the | |
200 | local port is not used to connect to the remote. For | |
201 | example, an IP address is configured on eth0 of the switch. The | |
202 | remote's VM is connected through eth1 of the switch, but an | |
203 | IP address has not been configured for that port on the switch. | |
204 | As such, the switch will use eth0 to connect to the remote, | |
205 | and eth1's rules about the local port will not work. In the | |
206 | example, the switch attached to eth0 would use rules (a), (b), | |
207 | (c), (h), and (i) on eth0. The switch attached to eth1 would use | |
208 | rules (f), (g), (h), and (i). | |
209 | ||
210 | The following are explicitly *not* supported by in-band control: | |
211 | ||
212 | - Specify Remote by Name. Currently, the remote must be | |
213 | identified by IP address. A naive approach would be to permit | |
214 | all DNS traffic. Unfortunately, this would prevent the | |
215 | controller from defining any policy over DNS. Since switches | |
216 | that are located behind us need to connect to the remote, | |
217 | in-band cannot simply add a rule that allows DNS traffic from | |
218 | the local port. The "correct" way to support this is to parse | |
219 | DNS requests to allow all traffic related to a request for the | |
220 | remote's name through. Due to the potential security | |
221 | problems and amount of processing, we decided to hold off for | |
222 | the time-being. | |
223 | ||
224 | - Differing Remotes for Switches. All switches must know | |
225 | the L3 addresses for all the remotes that other switches | |
226 | may use, since rules need to be set up to allow traffic related | |
227 | to those remotes through. See rules (f), (g), (h), and (i). | |
228 | ||
229 | - Differing Routes for Switches. In order for the switch to | |
230 | allow other switches to connect to a remote through a | |
231 | gateway, it allows the gateway's traffic through with rules (d) | |
232 | and (e). If the routes to the remote differ for the two | |
233 | switches, we will not know the MAC address of the alternate | |
234 | gateway. | |
235 | ||
236 | ||
d31f1109 JP |
237 | Suggestions |
238 | =========== | |
239 | ||
240 | Suggestions to improve Open vSwitch are welcome at discuss@openvswitch.org. |