]> git.proxmox.com Git - ovs.git/blame - datapath/README.md
datapath: compat: unset skb encapsulation bit
[ovs.git] / datapath / README.md
CommitLineData
fea393b1
BP
1Open vSwitch datapath developer documentation
2=============================================
3
4The Open vSwitch kernel module allows flexible userspace control over
5flow-level packet processing on selected network devices. It can be
6used to implement a plain Ethernet switch, network device bonding,
7VLAN processing, network access control, flow-based network control,
8and so on.
9
10The kernel module implements multiple "datapaths" (analogous to
11bridges), each of which can have multiple "vports" (analogous to ports
12within a bridge). Each datapath also has associated with it a "flow
13table" that userspace populates with "flows" that map from keys based
14on packet headers and metadata to sets of actions. The most common
15action forwards the packet to another vport; other actions are also
16implemented.
17
18When a packet arrives on a vport, the kernel module processes it by
19extracting its flow key and looking it up in the flow table. If there
20is a matching flow, it executes the associated actions. If there is
21no match, it queues the packet to userspace for processing (as part of
22its processing, userspace will likely set up a flow to handle further
23packets of the same type entirely in-kernel).
24
25
26Flow key compatibility
27----------------------
28
29Network protocols evolve over time. New protocols become important
30and existing protocols lose their prominence. For the Open vSwitch
31kernel module to remain relevant, it must be possible for newer
32versions to parse additional protocols as part of the flow key. It
33might even be desirable, someday, to drop support for parsing
34protocols that have become obsolete. Therefore, the Netlink interface
35to Open vSwitch is designed to allow carefully written userspace
36applications to work with any version of the flow key, past or future.
37
38To support this forward and backward compatibility, whenever the
39kernel module passes a packet to userspace, it also passes along the
40flow key that it parsed from the packet. Userspace then extracts its
41own notion of a flow key from the packet and compares it against the
42kernel-provided version:
43
542cc9bb
TG
44 - If userspace's notion of the flow key for the packet matches the
45 kernel's, then nothing special is necessary.
46
47 - If the kernel's flow key includes more fields than the userspace
48 version of the flow key, for example if the kernel decoded IPv6
49 headers but userspace stopped at the Ethernet type (because it
50 does not understand IPv6), then again nothing special is
51 necessary. Userspace can still set up a flow in the usual way,
52 as long as it uses the kernel-provided flow key to do it.
53
54 - If the userspace flow key includes more fields than the
55 kernel's, for example if userspace decoded an IPv6 header but
56 the kernel stopped at the Ethernet type, then userspace can
57 forward the packet manually, without setting up a flow in the
58 kernel. This case is bad for performance because every packet
59 that the kernel considers part of the flow must go to userspace,
60 but the forwarding behavior is correct. (If userspace can
61 determine that the values of the extra fields would not affect
62 forwarding behavior, then it could set up a flow anyway.)
fea393b1
BP
63
64How flow keys evolve over time is important to making this work, so
65the following sections go into detail.
66
67
68Flow key format
69---------------
70
71A flow key is passed over a Netlink socket as a sequence of Netlink
72attributes. Some attributes represent packet metadata, defined as any
73information about a packet that cannot be extracted from the packet
74itself, e.g. the vport on which the packet was received. Most
75attributes, however, are extracted from headers within the packet,
76e.g. source and destination addresses from Ethernet, IP, or TCP
77headers.
78
79The <linux/openvswitch.h> header file defines the exact format of the
80flow key attributes. For informal explanatory purposes here, we write
81them as comma-separated strings, with parentheses indicating arguments
82and nesting. For example, the following could represent a flow key
83corresponding to a TCP packet that arrived on vport 1:
84
85 in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
86 eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
87 frag=no), tcp(src=49163, dst=80)
88
89Often we ellipsize arguments not important to the discussion, e.g.:
90
91 in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
92
93
a1c564be
AZ
94Wildcarded flow key format
95--------------------------
96
97A wildcarded flow is described with two sequences of Netlink attributes
98passed over the Netlink socket. A flow key, exactly as described above, and an
99optional corresponding flow mask.
100
101A wildcarded flow can represent a group of exact match flows. Each '1' bit
d79ee67f 102in the mask specifies an exact match with the corresponding bit in the flow key.
a1c564be 103A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit
d79ee67f
L
104of an incoming packet. Using a wildcarded flow can improve the flow set up rate
105by reducing the number of new flows that need to be processed by the user space
106program.
a1c564be
AZ
107
108Support for the mask Netlink attribute is optional for both the kernel and user
109space program. The kernel can ignore the mask attribute, installing an exact
110match flow, or reduce the number of don't care bits in the kernel to less than
111what was specified by the user space program. In this case, variations in bits
112that the kernel does not implement will simply result in additional flow setups.
113The kernel module will also work with user space programs that neither support
114nor supply flow mask attributes.
115
116Since the kernel may ignore or modify wildcard bits, it can be difficult for
117the userspace program to know exactly what matches are installed. There are
118two possible approaches: reactively install flows as they miss the kernel
119flow table (and therefore not attempt to determine wildcard changes at all)
120or use the kernel's response messages to determine the installed wildcards.
121
122When interacting with userspace, the kernel should maintain the match portion
123of the key exactly as originally installed. This will provides a handle to
124identify the flow for all future operations. However, when reporting the
125mask of an installed flow, the mask should include any restrictions imposed
126by the kernel.
127
128The behavior when using overlapping wildcarded flows is undefined. It is the
129responsibility of the user space program to ensure that any incoming packet
130can match at most one flow, wildcarded or not. The current implementation
131performs best-effort detection of overlapping wildcarded flows and may reject
132some but not all of them. However, this behavior may change in future versions.
133
134
d9aa7218
JS
135Unique flow identifiers
136-----------------------
137
138An alternative to using the original match portion of a key as the handle for
139flow identification is a unique flow identifier, or "UFID". UFIDs are optional
140for both the kernel and user space program.
141
142User space programs that support UFID are expected to provide it during flow
143setup in addition to the flow, then refer to the flow using the UFID for all
144future operations. The kernel is not required to index flows by the original
145flow key if a UFID is specified.
146
147
fea393b1
BP
148Basic rule for evolving flow keys
149---------------------------------
150
151Some care is needed to really maintain forward and backward
152compatibility for applications that follow the rules listed under
153"Flow key compatibility" above.
154
155The basic rule is obvious:
156
157 ------------------------------------------------------------------
158 New network protocol support must only supplement existing flow
159 key attributes. It must not change the meaning of already defined
160 flow key attributes.
161 ------------------------------------------------------------------
162
163This rule does have less-obvious consequences so it is worth working
164through a few examples. Suppose, for example, that the kernel module
165did not already implement VLAN parsing. Instead, it just interpreted
166the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the
167packet. The flow key for any packet with an 802.1Q header would look
168essentially like this, ignoring metadata:
169
170 eth(...), eth_type(0x8100)
171
172Naively, to add VLAN support, it makes sense to add a new "vlan" flow
173key attribute to contain the VLAN tag, then continue to decode the
174encapsulated headers beyond the VLAN tag using the existing field
2c0d2b3b 175definitions. With this change, a TCP packet in VLAN 10 would have a
fea393b1
BP
176flow key much like this:
177
178 eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
179
180But this change would negatively affect a userspace application that
181has not been updated to understand the new "vlan" flow key attribute.
182The application could, following the flow compatibility rules above,
183ignore the "vlan" attribute that it does not understand and therefore
184assume that the flow contained IP packets. This is a bad assumption
185(the flow only contains IP packets if one parses and skips over the
186802.1Q header) and it could cause the application's behavior to change
187across kernel versions even though it follows the compatibility rules.
188
189The solution is to use a set of nested attributes. This is, for
190example, why 802.1Q support uses nested attributes. A TCP packet in
191VLAN 10 is actually expressed as:
192
193 eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
194 ip(proto=6, ...), tcp(...)))
195
196Notice how the "eth_type", "ip", and "tcp" flow key attributes are
197nested inside the "encap" attribute. Thus, an application that does
198not understand the "vlan" key will not see either of those attributes
199and therefore will not misinterpret them. (Also, the outer eth_type
200is still 0x8100, not changed to 0x0800.)
201
8ddc056d
BP
202Handling malformed packets
203--------------------------
204
205Don't drop packets in the kernel for malformed protocol headers, bad
206checksums, etc. This would prevent userspace from implementing a
207simple Ethernet switch that forwards every packet.
208
209Instead, in such a case, include an attribute with "empty" content.
210It doesn't matter if the empty content could be valid protocol values,
211as long as those values are rarely seen in practice, because userspace
212can always forward all packets with those values to userspace and
213handle them individually.
214
215For example, consider a packet that contains an IP header that
216indicates protocol 6 for TCP, but which is truncated just after the IP
217header, so that the TCP header is missing. The flow key for this
218packet would include a tcp attribute with all-zero src and dst, like
219this:
220
221 eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
222
223As another example, consider a packet with an Ethernet type of 0x8100,
224indicating that a VLAN TCI should follow, but which is truncated just
225after the Ethernet type. The flow key for this packet would include
226an all-zero-bits vlan and an empty encap attribute, like this:
227
228 eth(...), eth_type(0x8100), vlan(0), encap()
229
230Unlike a TCP packet with source and destination ports 0, an
231all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka
232VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan
233attribute expressly to allow this situation to be distinguished.
234Thus, the flow key in this second example unambiguously indicates a
235missing or malformed VLAN TCI.
236
fea393b1
BP
237Other rules
238-----------
239
240The other rules for flow keys are much less subtle:
241
542cc9bb 242 - Duplicate attributes are not allowed at a given nesting level.
fea393b1 243
542cc9bb 244 - Ordering of attributes is not significant.
fea393b1 245
542cc9bb
TG
246 - When the kernel sends a given flow key to userspace, it always
247 composes it the same way. This allows userspace to hash and
248 compare entire flow keys that it may not be able to fully
249 interpret.
b296b82a
AW
250
251
252Coding rules
253============
254
255Compatibility
256-------------
257
258Please implement the headers and codes for compatibility with older kernel
259in linux/compat/ directory. All public functions should be exported using
260EXPORT_SYMBOL macro. Public function replacing the same-named kernel
261function should be prefixed with 'rpl_'. Otherwise, the function should be
262prefixed with 'ovs_'. For special case when it is not possible to follow
263this rule (e.g., the pskb_expand_head() function), the function name must
264be added to linux/compat/build-aux/export-check-whitelist, otherwise, the
265compilation check 'check-export-symbol' will fail.