]>
Commit | Line | Data |
---|---|---|
542cc9bb TG |
1 | Using Open vSwitch with DPDK |
2 | ============================ | |
3 | ||
4 | Open vSwitch can use Intel(R) DPDK lib to operate entirely in | |
5 | userspace. This file explains how to install and use Open vSwitch in | |
6 | such a mode. | |
7 | ||
8 | The DPDK support of Open vSwitch is considered experimental. | |
9 | It has not been thoroughly tested. | |
10 | ||
11 | This version of Open vSwitch should be built manually with `configure` | |
12 | and `make`. | |
13 | ||
14 | OVS needs a system with 1GB hugepages support. | |
15 | ||
16 | Building and Installing: | |
17 | ------------------------ | |
18 | ||
18f777b2 | 19 | Required: DPDK 2.1 |
7d1ced01 CL |
20 | Optional (if building with vhost-cuse): `fuse`, `fuse-devel` (`libfuse-dev` |
21 | on Debian/Ubuntu) | |
542cc9bb TG |
22 | |
23 | 1. Configure build & install DPDK: | |
24 | 1. Set `$DPDK_DIR` | |
25 | ||
26 | ``` | |
18f777b2 | 27 | export DPDK_DIR=/usr/src/dpdk-2.1 |
542cc9bb TG |
28 | cd $DPDK_DIR |
29 | ``` | |
30 | ||
31 | 2. Update `config/common_linuxapp` so that DPDK generate single lib file. | |
32 | (modification also required for IVSHMEM build) | |
33 | ||
34 | `CONFIG_RTE_BUILD_COMBINE_LIBS=y` | |
35 | ||
777cb787 | 36 | Then run `make install` to build and install the library. |
542cc9bb TG |
37 | For default install without IVSHMEM: |
38 | ||
39 | `make install T=x86_64-native-linuxapp-gcc` | |
40 | ||
41 | To include IVSHMEM (shared memory): | |
42 | ||
43 | `make install T=x86_64-ivshmem-linuxapp-gcc` | |
44 | ||
45 | For further details refer to http://dpdk.org/ | |
46 | ||
47 | 2. Configure & build the Linux kernel: | |
48 | ||
49 | Refer to intel-dpdk-getting-started-guide.pdf for understanding | |
50 | DPDK kernel requirement. | |
51 | ||
52 | 3. Configure & build OVS: | |
53 | ||
54 | * Non IVSHMEM: | |
55 | ||
56 | `export DPDK_BUILD=$DPDK_DIR/x86_64-native-linuxapp-gcc/` | |
57 | ||
58 | * IVSHMEM: | |
59 | ||
60 | `export DPDK_BUILD=$DPDK_DIR/x86_64-ivshmem-linuxapp-gcc/` | |
61 | ||
62 | ``` | |
15b612f8 | 63 | cd $(OVS_DIR)/ |
542cc9bb | 64 | ./boot.sh |
543342a4 | 65 | ./configure --with-dpdk=$DPDK_BUILD [CFLAGS="-g -O2 -Wno-cast-align"] |
542cc9bb TG |
66 | make |
67 | ``` | |
68 | ||
543342a4 MK |
69 | Note: 'clang' users may specify the '-Wno-cast-align' flag to suppress DPDK cast-align warnings. |
70 | ||
542cc9bb TG |
71 | To have better performance one can enable aggressive compiler optimizations and |
72 | use the special instructions(popcnt, crc32) that may not be available on all | |
73 | machines. Instead of typing `make`, type: | |
74 | ||
75 | `make CFLAGS='-O3 -march=native'` | |
76 | ||
9feb1017 | 77 | Refer to [INSTALL.userspace.md] for general requirements of building userspace OVS. |
542cc9bb TG |
78 | |
79 | Using the DPDK with ovs-vswitchd: | |
80 | --------------------------------- | |
81 | ||
82 | 1. Setup system boot | |
83 | Add the following options to the kernel bootline: | |
84 | ||
85 | `default_hugepagesz=1GB hugepagesz=1G hugepages=1` | |
86 | ||
87 | 2. Setup DPDK devices: | |
491c2ea3 MG |
88 | |
89 | DPDK devices can be setup using either the VFIO (for DPDK 1.7+) or UIO | |
90 | modules. UIO requires inserting an out of tree driver igb_uio.ko that is | |
91 | available in DPDK. Setup for both methods are described below. | |
92 | ||
93 | * UIO: | |
94 | 1. insert uio.ko: `modprobe uio` | |
95 | 2. insert igb_uio.ko: `insmod $DPDK_BUILD/kmod/igb_uio.ko` | |
96 | 3. Bind network device to igb_uio: | |
dbde55e7 | 97 | `$DPDK_DIR/tools/dpdk_nic_bind.py --bind=igb_uio eth1` |
491c2ea3 MG |
98 | |
99 | * VFIO: | |
100 | ||
101 | VFIO needs to be supported in the kernel and the BIOS. More information | |
102 | can be found in the [DPDK Linux GSG]. | |
103 | ||
104 | 1. Insert vfio-pci.ko: `modprobe vfio-pci` | |
105 | 2. Set correct permissions on vfio device: `sudo /usr/bin/chmod a+x /dev/vfio` | |
106 | and: `sudo /usr/bin/chmod 0666 /dev/vfio/*` | |
107 | 3. Bind network device to vfio-pci: | |
dbde55e7 | 108 | `$DPDK_DIR/tools/dpdk_nic_bind.py --bind=vfio-pci eth1` |
542cc9bb | 109 | |
18f777b2 | 110 | 3. Mount the hugetable filesystem |
542cc9bb TG |
111 | |
112 | `mount -t hugetlbfs -o pagesize=1G none /dev/hugepages` | |
113 | ||
114 | Ref to http://www.dpdk.org/doc/quick-start for verifying DPDK setup. | |
115 | ||
a52b0492 GS |
116 | 4. Follow the instructions in [INSTALL.md] to install only the |
117 | userspace daemons and utilities (via 'make install'). | |
542cc9bb TG |
118 | 1. First time only db creation (or clearing): |
119 | ||
a52b0492 GS |
120 | ``` |
121 | mkdir -p /usr/local/etc/openvswitch | |
122 | mkdir -p /usr/local/var/run/openvswitch | |
123 | rm /usr/local/etc/openvswitch/conf.db | |
124 | ovsdb-tool create /usr/local/etc/openvswitch/conf.db \ | |
125 | /usr/local/share/openvswitch/vswitch.ovsschema | |
126 | ``` | |
542cc9bb | 127 | |
a52b0492 | 128 | 2. Start ovsdb-server |
542cc9bb | 129 | |
a52b0492 GS |
130 | ``` |
131 | ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock \ | |
542cc9bb TG |
132 | --remote=db:Open_vSwitch,Open_vSwitch,manager_options \ |
133 | --private-key=db:Open_vSwitch,SSL,private_key \ | |
134 | --certificate=Open_vSwitch,SSL,certificate \ | |
135 | --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --pidfile --detach | |
a52b0492 | 136 | ``` |
542cc9bb TG |
137 | |
138 | 3. First time after db creation, initialize: | |
139 | ||
a52b0492 GS |
140 | ``` |
141 | ovs-vsctl --no-wait init | |
142 | ``` | |
542cc9bb TG |
143 | |
144 | 5. Start vswitchd: | |
145 | ||
146 | DPDK configuration arguments can be passed to vswitchd via `--dpdk` | |
147 | argument. This needs to be first argument passed to vswitchd process. | |
148 | dpdk arg -c is ignored by ovs-dpdk, but it is a required parameter | |
149 | for dpdk initialization. | |
150 | ||
a52b0492 GS |
151 | ``` |
152 | export DB_SOCK=/usr/local/var/run/openvswitch/db.sock | |
153 | ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK --pidfile --detach | |
154 | ``` | |
542cc9bb | 155 | |
a52b0492 GS |
156 | If allocated more than one GB hugepage (as for IVSHMEM), set amount and |
157 | use NUMA node 0 memory: | |
542cc9bb | 158 | |
a52b0492 GS |
159 | ``` |
160 | ovs-vswitchd --dpdk -c 0x1 -n 4 --socket-mem 1024,0 \ | |
161 | -- unix:$DB_SOCK --pidfile --detach | |
162 | ``` | |
542cc9bb TG |
163 | |
164 | 6. Add bridge & ports | |
b8e57534 | 165 | |
542cc9bb TG |
166 | To use ovs-vswitchd with DPDK, create a bridge with datapath_type |
167 | "netdev" in the configuration database. For example: | |
168 | ||
a52b0492 | 169 | `ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev` |
542cc9bb | 170 | |
f748d99a RB |
171 | Now you can add dpdk devices. OVS expects DPDK device names to start with |
172 | "dpdk" and end with a portid. vswitchd should print (in the log file) the | |
173 | number of dpdk devices found. | |
542cc9bb | 174 | |
a52b0492 GS |
175 | ``` |
176 | ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk | |
177 | ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk | |
178 | ``` | |
542cc9bb | 179 | |
a52b0492 GS |
180 | Once first DPDK port is added to vswitchd, it creates a Polling thread and |
181 | polls dpdk device in continuous loop. Therefore CPU utilization | |
182 | for that thread is always 100%. | |
542cc9bb | 183 | |
77c180ce BM |
184 | Note: creating bonds of DPDK interfaces is slightly different to creating |
185 | bonds of system interfaces. For DPDK, the interface type must be explicitly | |
186 | set, for example: | |
187 | ||
188 | ``` | |
189 | ovs-vsctl add-bond br0 dpdkbond dpdk0 dpdk1 -- set Interface dpdk0 type=dpdk -- set Interface dpdk1 type=dpdk | |
190 | ``` | |
191 | ||
542cc9bb TG |
192 | 7. Add test flows |
193 | ||
194 | Test flow script across NICs (assuming ovs in /usr/src/ovs): | |
195 | Execute script: | |
196 | ||
197 | ``` | |
198 | #! /bin/sh | |
199 | # Move to command directory | |
200 | cd /usr/src/ovs/utilities/ | |
201 | ||
202 | # Clear current flows | |
203 | ./ovs-ofctl del-flows br0 | |
204 | ||
205 | # Add flows between port 1 (dpdk0) to port 2 (dpdk1) | |
206 | ./ovs-ofctl add-flow br0 in_port=1,action=output:2 | |
207 | ./ovs-ofctl add-flow br0 in_port=2,action=output:1 | |
208 | ``` | |
209 | ||
188d29d7 KT |
210 | Performance Tuning: |
211 | ------------------- | |
542cc9bb | 212 | |
188d29d7 | 213 | 1. PMD affinitization |
542cc9bb | 214 | |
188d29d7 KT |
215 | A poll mode driver (pmd) thread handles the I/O of all DPDK |
216 | interfaces assigned to it. A pmd thread will busy loop through | |
217 | the assigned port/rxq's polling for packets, switch the packets | |
218 | and send to a tx port if required. Typically, it is found that | |
219 | a pmd thread is CPU bound, meaning that the greater the CPU | |
220 | occupancy the pmd thread can get, the better the performance. To | |
221 | that end, it is good practice to ensure that a pmd thread has as | |
222 | many cycles on a core available to it as possible. This can be | |
223 | achieved by affinitizing the pmd thread with a core that has no | |
224 | other workload. See section 7 below for a description of how to | |
225 | isolate cores for this purpose also. | |
542cc9bb | 226 | |
188d29d7 KT |
227 | The following command can be used to specify the affinity of the |
228 | pmd thread(s). | |
542cc9bb | 229 | |
188d29d7 | 230 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=<hex string>` |
542cc9bb | 231 | |
188d29d7 KT |
232 | By setting a bit in the mask, a pmd thread is created and pinned |
233 | to the corresponding CPU core. e.g. to run a pmd thread on core 1 | |
542cc9bb | 234 | |
188d29d7 | 235 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=2` |
542cc9bb | 236 | |
188d29d7 | 237 | For more information, please refer to the Open_vSwitch TABLE section in |
542cc9bb | 238 | |
188d29d7 | 239 | `man ovs-vswitchd.conf.db` |
542cc9bb | 240 | |
188d29d7 KT |
241 | Note, that a pmd thread on a NUMA node is only created if there is |
242 | at least one DPDK interface from that NUMA node added to OVS. | |
542cc9bb | 243 | |
188d29d7 | 244 | 2. Multiple poll mode driver threads |
542cc9bb | 245 | |
188d29d7 KT |
246 | With pmd multi-threading support, OVS creates one pmd thread |
247 | for each NUMA node by default. However, it can be seen that in cases | |
248 | where there are multiple ports/rxq's producing traffic, performance | |
249 | can be improved by creating multiple pmd threads running on separate | |
250 | cores. These pmd threads can then share the workload by each being | |
251 | responsible for different ports/rxq's. Assignment of ports/rxq's to | |
252 | pmd threads is done automatically. | |
542cc9bb | 253 | |
188d29d7 KT |
254 | The following command can be used to specify the affinity of the |
255 | pmd threads. | |
542cc9bb | 256 | |
188d29d7 KT |
257 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=<hex string>` |
258 | ||
259 | A set bit in the mask means a pmd thread is created and pinned | |
260 | to the corresponding CPU core. e.g. to run pmd threads on core 1 and 2 | |
261 | ||
262 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=6` | |
263 | ||
264 | For more information, please refer to the Open_vSwitch TABLE section in | |
265 | ||
266 | `man ovs-vswitchd.conf.db` | |
267 | ||
268 | For example, when using dpdk and dpdkvhostuser ports in a bi-directional | |
269 | VM loopback as shown below, spreading the workload over 2 or 4 pmd | |
270 | threads shows significant improvements as there will be more total CPU | |
271 | occupancy available. | |
272 | ||
273 | NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1 | |
274 | ||
275 | The OVS log can be checked to confirm that the port/rxq assignment to | |
276 | pmd threads is as required. This can also be checked with the following | |
277 | commands: | |
278 | ||
279 | ``` | |
280 | top -H | |
281 | taskset -p <pid_of_pmd> | |
282 | ``` | |
283 | ||
284 | To understand where most of the pmd thread time is spent and whether the | |
285 | caches are being utilized, these commands can be used: | |
286 | ||
287 | ``` | |
288 | # Clear previous stats | |
289 | ovs-appctl dpif-netdev/pmd-stats-clear | |
290 | ||
291 | # Check current stats | |
292 | ovs-appctl dpif-netdev/pmd-stats-show | |
293 | ``` | |
294 | ||
295 | 3. DPDK port Rx Queues | |
296 | ||
297 | `ovs-vsctl set Open_vSwitch . other_config:n-dpdk-rxqs=<integer>` | |
298 | ||
299 | The command above sets the number of rx queues for each DPDK interface. | |
300 | The rx queues are assigned to pmd threads on the same NUMA node in a | |
301 | round-robin fashion. For more information, please refer to the | |
302 | Open_vSwitch TABLE section in | |
303 | ||
304 | `man ovs-vswitchd.conf.db` | |
305 | ||
306 | 4. Exact Match Cache | |
307 | ||
308 | Each pmd thread contains one EMC. After initial flow setup in the | |
309 | datapath, the EMC contains a single table and provides the lowest level | |
310 | (fastest) switching for DPDK ports. If there is a miss in the EMC then | |
311 | the next level where switching will occur is the datapath classifier. | |
312 | Missing in the EMC and looking up in the datapath classifier incurs a | |
313 | significant performance penalty. If lookup misses occur in the EMC | |
314 | because it is too small to handle the number of flows, its size can | |
315 | be increased. The EMC size can be modified by editing the define | |
316 | EM_FLOW_HASH_SHIFT in lib/dpif-netdev.c. | |
317 | ||
318 | As mentioned above an EMC is per pmd thread. So an alternative way of | |
319 | increasing the aggregate amount of possible flow entries in EMC and | |
320 | avoiding datapath classifier lookups is to have multiple pmd threads | |
321 | running. This can be done as described in section 2. | |
322 | ||
323 | 5. Compiler options | |
324 | ||
325 | The default compiler optimization level is '-O2'. Changing this to | |
326 | more aggressive compiler optimizations such as '-O3' or | |
327 | '-Ofast -march=native' with gcc can produce performance gains. | |
328 | ||
329 | 6. Simultaneous Multithreading (SMT) | |
330 | ||
331 | With SMT enabled, one physical core appears as two logical cores | |
332 | which can improve performance. | |
333 | ||
334 | SMT can be utilized to add additional pmd threads without consuming | |
335 | additional physical cores. Additional pmd threads may be added in the | |
336 | same manner as described in section 2. If trying to minimize the use | |
337 | of physical cores for pmd threads, care must be taken to set the | |
338 | correct bits in the pmd-cpu-mask to ensure that the pmd threads are | |
339 | pinned to SMT siblings. | |
340 | ||
341 | For example, when using 2x 10 core processors in a dual socket system | |
342 | with HT enabled, /proc/cpuinfo will report 40 logical cores. To use | |
343 | two logical cores which share the same physical core for pmd threads, | |
344 | the following command can be used to identify a pair of logical cores. | |
345 | ||
346 | `cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list` | |
347 | ||
348 | where N is the logical core number. In this example, it would show that | |
349 | cores 1 and 21 share the same physical core. The pmd-cpu-mask to enable | |
350 | two pmd threads running on these two logical cores (one physical core) | |
351 | is. | |
352 | ||
353 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=100002` | |
354 | ||
355 | Note that SMT is enabled by the Hyper-Threading section in the | |
356 | BIOS, and as such will apply to the whole system. So the impact of | |
357 | enabling/disabling it for the whole system should be considered | |
358 | e.g. If workloads on the system can scale across multiple cores, | |
359 | SMT may very beneficial. However, if they do not and perform best | |
360 | on a single physical core, SMT may not be beneficial. | |
361 | ||
362 | 7. The isolcpus kernel boot parameter | |
363 | ||
364 | isolcpus can be used on the kernel bootline to isolate cores from the | |
365 | kernel scheduler and hence dedicate them to OVS or other packet | |
366 | forwarding related workloads. For example a Linux kernel boot-line | |
367 | could be: | |
368 | ||
369 | 'GRUB_CMDLINE_LINUX_DEFAULT="quiet hugepagesz=1G hugepages=4 default_hugepagesz=1G 'intel_iommu=off' isolcpus=1-19"' | |
370 | ||
371 | 8. NUMA/Cluster On Die | |
372 | ||
373 | Ideally inter NUMA datapaths should be avoided where possible as packets | |
374 | will go across QPI and there may be a slight performance penalty when | |
375 | compared with intra NUMA datapaths. On Intel Xeon Processor E5 v3, | |
376 | Cluster On Die is introduced on models that have 10 cores or more. | |
377 | This makes it possible to logically split a socket into two NUMA regions | |
378 | and again it is preferred where possible to keep critical datapaths | |
379 | within the one cluster. | |
380 | ||
381 | It is good practice to ensure that threads that are in the datapath are | |
382 | pinned to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs | |
383 | responsible for forwarding. | |
384 | ||
385 | 9. Rx Mergeable buffers | |
386 | ||
387 | Rx Mergeable buffers is a virtio feature that allows chaining of multiple | |
388 | virtio descriptors to handle large packet sizes. As such, large packets | |
389 | are handled by reserving and chaining multiple free descriptors | |
390 | together. Mergeable buffer support is negotiated between the virtio | |
391 | driver and virtio device and is supported by the DPDK vhost library. | |
392 | This behavior is typically supported and enabled by default, however | |
393 | in the case where the user knows that rx mergeable buffers are not needed | |
394 | i.e. jumbo frames are not needed, it can be forced off by adding | |
395 | rx_mrgbuf=off to the QEMU command line options. By not reserving multiple | |
396 | chains of descriptors it will make more individual virtio descriptors | |
397 | available for rx to the guest using dpdkvhost ports and this can improve | |
398 | performance. | |
399 | ||
400 | 10. Packet processing in the guest | |
401 | ||
402 | It is good practice whether simply forwarding packets from one | |
403 | interface to another or more complex packet processing in the guest, | |
404 | to ensure that the thread performing this work has as much CPU | |
405 | occupancy as possible. For example when the DPDK sample application | |
406 | `testpmd` is used to forward packets in the guest, multiple QEMU vCPU | |
407 | threads can be created. Taskset can then be used to affinitize the | |
408 | vCPU thread responsible for forwarding to a dedicated core not used | |
409 | for other general processing on the host system. | |
410 | ||
411 | 11. DPDK virtio pmd in the guest | |
412 | ||
413 | dpdkvhostcuse or dpdkvhostuser ports can be used to accelerate the path | |
414 | to the guest using the DPDK vhost library. This library is compatible with | |
415 | virtio-net drivers in the guest but significantly better performance can | |
416 | be observed when using the DPDK virtio pmd driver in the guest. The DPDK | |
417 | `testpmd` application can be used in the guest as an example application | |
418 | that forwards packet from one DPDK vhost port to another. An example of | |
419 | running `testpmd` in the guest can be seen here. | |
420 | ||
421 | `./testpmd -c 0x3 -n 4 --socket-mem 512 -- --burst=64 -i --txqflags=0xf00 --disable-hw-vlan --forward-mode=io --auto-start` | |
422 | ||
423 | See below information on dpdkvhostcuse and dpdkvhostuser ports. | |
424 | See [DPDK Docs] for more information on `testpmd`. | |
542cc9bb | 425 | |
6553d06b | 426 | |
6553d06b | 427 | |
542cc9bb TG |
428 | DPDK Rings : |
429 | ------------ | |
430 | ||
431 | Following the steps above to create a bridge, you can now add dpdk rings | |
432 | as a port to the vswitch. OVS will expect the DPDK ring device name to | |
433 | start with dpdkr and end with a portid. | |
434 | ||
a52b0492 | 435 | `ovs-vsctl add-port br0 dpdkr0 -- set Interface dpdkr0 type=dpdkr` |
542cc9bb TG |
436 | |
437 | DPDK rings client test application | |
438 | ||
439 | Included in the test directory is a sample DPDK application for testing | |
440 | the rings. This is from the base dpdk directory and modified to work | |
441 | with the ring naming used within ovs. | |
442 | ||
443 | location tests/ovs_client | |
444 | ||
445 | To run the client : | |
446 | ||
a52b0492 GS |
447 | ``` |
448 | cd /usr/src/ovs/tests/ | |
449 | ovsclient -c 1 -n 4 --proc-type=secondary -- -n "port id you gave dpdkr" | |
450 | ``` | |
542cc9bb TG |
451 | |
452 | In the case of the dpdkr example above the "port id you gave dpdkr" is 0. | |
453 | ||
454 | It is essential to have --proc-type=secondary | |
455 | ||
456 | The application simply receives an mbuf on the receive queue of the | |
457 | ethernet ring and then places that same mbuf on the transmit ring of | |
458 | the ethernet ring. It is a trivial loopback application. | |
459 | ||
460 | DPDK rings in VM (IVSHMEM shared memory communications) | |
461 | ------------------------------------------------------- | |
462 | ||
463 | In addition to executing the client in the host, you can execute it within | |
464 | a guest VM. To do so you will need a patched qemu. You can download the | |
465 | patch and getting started guide at : | |
466 | ||
467 | https://01.org/packet-processing/downloads | |
468 | ||
469 | A general rule of thumb for better performance is that the client | |
470 | application should not be assigned the same dpdk core mask "-c" as | |
471 | the vswitchd. | |
472 | ||
58397e6c KT |
473 | DPDK vhost: |
474 | ----------- | |
475 | ||
18f777b2 | 476 | DPDK 2.1 supports two types of vhost: |
58397e6c | 477 | |
7d1ced01 CL |
478 | 1. vhost-user |
479 | 2. vhost-cuse | |
58397e6c | 480 | |
7d1ced01 CL |
481 | Whatever type of vhost is enabled in the DPDK build specified, is the type |
482 | that will be enabled in OVS. By default, vhost-user is enabled in DPDK. | |
483 | Therefore, unless vhost-cuse has been enabled in DPDK, vhost-user ports | |
484 | will be enabled in OVS. | |
485 | Please note that support for vhost-cuse is intended to be deprecated in OVS | |
486 | in a future release. | |
58397e6c | 487 | |
7d1ced01 CL |
488 | DPDK vhost-user: |
489 | ---------------- | |
58397e6c | 490 | |
7d1ced01 CL |
491 | The following sections describe the use of vhost-user 'dpdkvhostuser' ports |
492 | with OVS. | |
58397e6c | 493 | |
7d1ced01 CL |
494 | DPDK vhost-user Prerequisites: |
495 | ------------------------- | |
58397e6c | 496 | |
18f777b2 | 497 | 1. DPDK 2.1 with vhost support enabled as documented in the "Building and |
7d1ced01 | 498 | Installing section" |
58397e6c | 499 | |
7d1ced01 | 500 | 2. QEMU version v2.1.0+ |
58397e6c | 501 | |
7d1ced01 CL |
502 | QEMU v2.1.0 will suffice, but it is recommended to use v2.2.0 if providing |
503 | your VM with memory greater than 1GB due to potential issues with memory | |
504 | mapping larger areas. | |
58397e6c | 505 | |
7d1ced01 CL |
506 | Adding DPDK vhost-user ports to the Switch: |
507 | -------------------------------------- | |
58397e6c | 508 | |
7d1ced01 CL |
509 | Following the steps above to create a bridge, you can now add DPDK vhost-user |
510 | as a port to the vswitch. Unlike DPDK ring ports, DPDK vhost-user ports can | |
511 | have arbitrary names. | |
58397e6c | 512 | |
7d1ced01 | 513 | - For vhost-user, the name of the port type is `dpdkvhostuser` |
58397e6c | 514 | |
7d1ced01 | 515 | ``` |
1af65cc7 | 516 | ovs-vsctl add-port br0 vhost-user-1 -- set Interface vhost-user-1 |
7d1ced01 CL |
517 | type=dpdkvhostuser |
518 | ``` | |
519 | ||
520 | This action creates a socket located at | |
521 | `/usr/local/var/run/openvswitch/vhost-user-1`, which you must provide | |
522 | to your VM on the QEMU command line. More instructions on this can be | |
523 | found in the next section "DPDK vhost-user VM configuration" | |
524 | Note: If you wish for the vhost-user sockets to be created in a | |
525 | directory other than `/usr/local/var/run/openvswitch`, you may specify | |
526 | another location on the ovs-vswitchd command line like so: | |
527 | ||
528 | `./vswitchd/ovs-vswitchd --dpdk -vhost_sock_dir /my-dir -c 0x1 ...` | |
529 | ||
530 | DPDK vhost-user VM configuration: | |
531 | --------------------------------- | |
532 | Follow the steps below to attach vhost-user port(s) to a VM. | |
533 | ||
534 | 1. Configure sockets. | |
535 | Pass the following parameters to QEMU to attach a vhost-user device: | |
536 | ||
537 | ``` | |
538 | -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 | |
539 | -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce | |
540 | -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 | |
541 | ``` | |
542 | ||
543 | ...where vhost-user-1 is the name of the vhost-user port added | |
544 | to the switch. | |
545 | Repeat the above parameters for multiple devices, changing the | |
546 | chardev path and id as necessary. Note that a separate and different | |
547 | chardev path needs to be specified for each vhost-user device. For | |
548 | example you have a second vhost-user port named 'vhost-user-2', you | |
549 | append your QEMU command line with an additional set of parameters: | |
550 | ||
551 | ``` | |
552 | -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 | |
553 | -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce | |
554 | -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2 | |
555 | ``` | |
556 | ||
557 | 2. Configure huge pages. | |
558 | QEMU must allocate the VM's memory on hugetlbfs. vhost-user ports access | |
559 | a virtio-net device's virtual rings and packet buffers mapping the VM's | |
560 | physical memory on hugetlbfs. To enable vhost-user ports to map the VM's | |
561 | memory into their process address space, pass the following paramters | |
562 | to QEMU: | |
563 | ||
564 | ``` | |
565 | -object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages, | |
566 | share=on | |
567 | -numa node,memdev=mem -mem-prealloc | |
568 | ``` | |
569 | ||
570 | DPDK vhost-cuse: | |
571 | ---------------- | |
572 | ||
573 | The following sections describe the use of vhost-cuse 'dpdkvhostcuse' ports | |
574 | with OVS. | |
575 | ||
576 | DPDK vhost-cuse Prerequisites: | |
577 | ------------------------- | |
578 | ||
18f777b2 | 579 | 1. DPDK 2.1 with vhost support enabled as documented in the "Building and |
7d1ced01 CL |
580 | Installing section" |
581 | As an additional step, you must enable vhost-cuse in DPDK by setting the | |
582 | following additional flag in `config/common_linuxapp`: | |
583 | ||
584 | `CONFIG_RTE_LIBRTE_VHOST_USER=n` | |
585 | ||
586 | Following this, rebuild DPDK as per the instructions in the "Building and | |
587 | Installing" section. Finally, rebuild OVS as per step 3 in the "Building | |
588 | and Installing" section - OVS will detect that DPDK has vhost-cuse libraries | |
589 | compiled and in turn will enable support for it in the switch and disable | |
590 | vhost-user support. | |
591 | ||
592 | 2. Insert the Cuse module: | |
593 | ||
594 | `modprobe cuse` | |
595 | ||
596 | 3. Build and insert the `eventfd_link` module: | |
597 | ||
598 | ``` | |
599 | cd $DPDK_DIR/lib/librte_vhost/eventfd_link/ | |
600 | make | |
601 | insmod $DPDK_DIR/lib/librte_vhost/eventfd_link.ko | |
602 | ``` | |
603 | ||
604 | 4. QEMU version v2.1.0+ | |
605 | ||
606 | vhost-cuse will work with QEMU v2.1.0 and above, however it is recommended to | |
607 | use v2.2.0 if providing your VM with memory greater than 1GB due to potential | |
608 | issues with memory mapping larger areas. | |
609 | Note: QEMU v1.6.2 will also work, with slightly different command line parameters, | |
610 | which are specified later in this document. | |
611 | ||
612 | Adding DPDK vhost-cuse ports to the Switch: | |
613 | -------------------------------------- | |
614 | ||
615 | Following the steps above to create a bridge, you can now add DPDK vhost-cuse | |
616 | as a port to the vswitch. Unlike DPDK ring ports, DPDK vhost-cuse ports can have | |
617 | arbitrary names. | |
618 | ||
619 | - For vhost-cuse, the name of the port type is `dpdkvhostcuse` | |
620 | ||
621 | ``` | |
1af65cc7 | 622 | ovs-vsctl add-port br0 vhost-cuse-1 -- set Interface vhost-cuse-1 |
7d1ced01 CL |
623 | type=dpdkvhostcuse |
624 | ``` | |
625 | ||
626 | When attaching vhost-cuse ports to QEMU, the name provided during the | |
627 | add-port operation must match the ifname parameter on the QEMU command | |
628 | line. More instructions on this can be found in the next section. | |
629 | ||
630 | DPDK vhost-cuse VM configuration: | |
631 | --------------------------------- | |
632 | ||
633 | vhost-cuse ports use a Linux* character device to communicate with QEMU. | |
58397e6c KT |
634 | By default it is set to `/dev/vhost-net`. It is possible to reuse this |
635 | standard device for DPDK vhost, which makes setup a little simpler but it | |
636 | is better practice to specify an alternative character device in order to | |
637 | avoid any conflicts if kernel vhost is to be used in parallel. | |
638 | ||
639 | 1. This step is only needed if using an alternative character device. | |
640 | ||
641 | The new character device filename must be specified on the vswitchd | |
642 | commandline: | |
643 | ||
644 | `./vswitchd/ovs-vswitchd --dpdk --cuse_dev_name my-vhost-net -c 0x1 ...` | |
645 | ||
646 | Note that the `--cuse_dev_name` argument and associated string must be the first | |
647 | arguments after `--dpdk` and come before the EAL arguments. In the example | |
648 | above, the character device to be used will be `/dev/my-vhost-net`. | |
649 | ||
650 | 2. This step is only needed if reusing the standard character device. It will | |
651 | conflict with the kernel vhost character device so the user must first | |
652 | remove it. | |
653 | ||
654 | `rm -rf /dev/vhost-net` | |
655 | ||
656 | 3a. Configure virtio-net adaptors: | |
657 | The following parameters must be passed to the QEMU binary: | |
658 | ||
659 | ``` | |
660 | -netdev tap,id=<id>,script=no,downscript=no,ifname=<name>,vhost=on | |
661 | -device virtio-net-pci,netdev=net1,mac=<mac> | |
662 | ``` | |
663 | ||
664 | Repeat the above parameters for multiple devices. | |
665 | ||
666 | The DPDK vhost library will negiotiate its own features, so they | |
667 | need not be passed in as command line params. Note that as offloads are | |
668 | disabled this is the equivalent of setting: | |
669 | ||
670 | `csum=off,gso=off,guest_tso4=off,guest_tso6=off,guest_ecn=off` | |
671 | ||
672 | 3b. If using an alternative character device. It must be also explicitly | |
673 | passed to QEMU using the `vhostfd` argument: | |
674 | ||
675 | ``` | |
676 | -netdev tap,id=<id>,script=no,downscript=no,ifname=<name>,vhost=on, | |
677 | vhostfd=<open_fd> | |
678 | -device virtio-net-pci,netdev=net1,mac=<mac> | |
679 | ``` | |
680 | ||
681 | The open file descriptor must be passed to QEMU running as a child | |
682 | process. This could be done with a simple python script. | |
683 | ||
684 | ``` | |
685 | #!/usr/bin/python | |
686 | fd = os.open("/dev/usvhost", os.O_RDWR) | |
687 | subprocess.call("qemu-system-x86_64 .... -netdev tap,id=vhostnet0,\ | |
688 | vhost=on,vhostfd=" + fd +"...", shell=True) | |
689 | ||
898dcef1 | 690 | Alternatively the `qemu-wrap.py` script can be used to automate the |
58397e6c KT |
691 | requirements specified above and can be used in conjunction with libvirt if |
692 | desired. See the "DPDK vhost VM configuration with QEMU wrapper" section | |
693 | below. | |
694 | ||
695 | 4. Configure huge pages: | |
696 | QEMU must allocate the VM's memory on hugetlbfs. Vhost ports access a | |
697 | virtio-net device's virtual rings and packet buffers mapping the VM's | |
698 | physical memory on hugetlbfs. To enable vhost-ports to map the VM's | |
7d1ced01 | 699 | memory into their process address space, pass the following parameters |
58397e6c KT |
700 | to QEMU: |
701 | ||
702 | `-object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages, | |
703 | share=on -numa node,memdev=mem -mem-prealloc` | |
704 | ||
7d1ced01 CL |
705 | Note: For use with an earlier QEMU version such as v1.6.2, use the |
706 | following to configure hugepages instead: | |
58397e6c | 707 | |
7d1ced01 | 708 | `-mem-path /dev/hugepages -mem-prealloc` |
58397e6c | 709 | |
7d1ced01 CL |
710 | DPDK vhost-cuse VM configuration with QEMU wrapper: |
711 | --------------------------------------------------- | |
58397e6c KT |
712 | The QEMU wrapper script automatically detects and calls QEMU with the |
713 | necessary parameters. It performs the following actions: | |
714 | ||
715 | * Automatically detects the location of the hugetlbfs and inserts this | |
716 | into the command line parameters. | |
717 | * Automatically open file descriptors for each virtio-net device and | |
718 | inserts this into the command line parameters. | |
719 | * Calls QEMU passing both the command line parameters passed to the | |
720 | script itself and those it has auto-detected. | |
721 | ||
722 | Before use, you **must** edit the configuration parameters section of the | |
723 | script to point to the correct emulator location and set additional | |
724 | settings. Of these settings, `emul_path` and `us_vhost_path` **must** be | |
725 | set. All other settings are optional. | |
726 | ||
727 | To use directly from the command line simply pass the wrapper some of the | |
728 | QEMU parameters: it will configure the rest. For example: | |
729 | ||
730 | ``` | |
731 | qemu-wrap.py -cpu host -boot c -hda <disk image> -m 4096 -smp 4 | |
732 | --enable-kvm -nographic -vnc none -net none -netdev tap,id=net1, | |
733 | script=no,downscript=no,ifname=if1,vhost=on -device virtio-net-pci, | |
734 | netdev=net1,mac=00:00:00:00:00:01 | |
5568661c | 735 | ``` |
58397e6c | 736 | |
7d1ced01 CL |
737 | DPDK vhost-cuse VM configuration with libvirt: |
738 | ---------------------------------------------- | |
58397e6c KT |
739 | |
740 | If you are using libvirt, you must enable libvirt to access the character | |
741 | device by adding it to controllers cgroup for libvirtd using the following | |
742 | steps. | |
743 | ||
744 | 1. In `/etc/libvirt/qemu.conf` add/edit the following lines: | |
745 | ||
746 | ``` | |
747 | 1) clear_emulator_capabilities = 0 | |
748 | 2) user = "root" | |
749 | 3) group = "root" | |
750 | 4) cgroup_device_acl = [ | |
751 | "/dev/null", "/dev/full", "/dev/zero", | |
752 | "/dev/random", "/dev/urandom", | |
753 | "/dev/ptmx", "/dev/kvm", "/dev/kqemu", | |
754 | "/dev/rtc", "/dev/hpet", "/dev/net/tun", | |
755 | "/dev/<my-vhost-device>", | |
756 | "/dev/hugepages"] | |
757 | ``` | |
758 | ||
759 | <my-vhost-device> refers to "vhost-net" if using the `/dev/vhost-net` | |
760 | device. If you have specificed a different name on the ovs-vswitchd | |
761 | commandline using the "--cuse_dev_name" parameter, please specify that | |
762 | filename instead. | |
763 | ||
764 | 2. Disable SELinux or set to permissive mode | |
765 | ||
766 | 3. Restart the libvirtd process | |
767 | For example, on Fedora: | |
768 | ||
769 | `systemctl restart libvirtd.service` | |
770 | ||
771 | After successfully editing the configuration, you may launch your | |
772 | vhost-enabled VM. The XML describing the VM can be configured like so | |
773 | within the <qemu:commandline> section: | |
774 | ||
775 | 1. Set up shared hugepages: | |
776 | ||
777 | ``` | |
778 | <qemu:arg value='-object'/> | |
779 | <qemu:arg value='memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on'/> | |
780 | <qemu:arg value='-numa'/> | |
781 | <qemu:arg value='node,memdev=mem'/> | |
782 | <qemu:arg value='-mem-prealloc'/> | |
783 | ``` | |
784 | ||
785 | 2. Set up your tap devices: | |
786 | ||
787 | ``` | |
788 | <qemu:arg value='-netdev'/> | |
789 | <qemu:arg value='type=tap,id=net1,script=no,downscript=no,ifname=vhost0,vhost=on'/> | |
790 | <qemu:arg value='-device'/> | |
791 | <qemu:arg value='virtio-net-pci,netdev=net1,mac=00:00:00:00:00:01'/> | |
792 | ``` | |
793 | ||
794 | Repeat for as many devices as are desired, modifying the id, ifname | |
795 | and mac as necessary. | |
796 | ||
797 | Again, if you are using an alternative character device (other than | |
798 | `/dev/vhost-net`), please specify the file descriptor like so: | |
799 | ||
800 | `<qemu:arg value='type=tap,id=net3,script=no,downscript=no,ifname=vhost0,vhost=on,vhostfd=<open_fd>'/>` | |
801 | ||
802 | Where <open_fd> refers to the open file descriptor of the character device. | |
803 | Instructions of how to retrieve the file descriptor can be found in the | |
804 | "DPDK vhost VM configuration" section. | |
805 | Alternatively, the process is automated with the qemu-wrap.py script, | |
806 | detailed in the next section. | |
807 | ||
808 | Now you may launch your VM using virt-manager, or like so: | |
809 | ||
810 | `virsh create my_vhost_vm.xml` | |
811 | ||
7d1ced01 | 812 | DPDK vhost-cuse VM configuration with libvirt and QEMU wrapper: |
58397e6c KT |
813 | ---------------------------------------------------------- |
814 | ||
815 | To use the qemu-wrapper script in conjuntion with libvirt, follow the | |
816 | steps in the previous section before proceeding with the following steps: | |
817 | ||
818 | 1. Place `qemu-wrap.py` in libvirtd's binary search PATH ($PATH) | |
819 | Ideally in the same directory that the QEMU binary is located. | |
820 | ||
821 | 2. Ensure that the script has the same owner/group and file permissions | |
822 | as the QEMU binary. | |
823 | ||
824 | 3. Update the VM xml file using "virsh edit VM.xml" | |
825 | ||
826 | 1. Set the VM to use the launch script. | |
827 | Set the emulator path contained in the `<emulator><emulator/>` tags. | |
828 | For example, replace: | |
829 | ||
830 | `<emulator>/usr/bin/qemu-kvm<emulator/>` | |
831 | ||
832 | with: | |
833 | ||
834 | `<emulator>/usr/bin/qemu-wrap.py<emulator/>` | |
835 | ||
836 | 4. Edit the Configuration Parameters section of the script to point to | |
837 | the correct emulator location and set any additional options. If you are | |
838 | using a alternative character device name, please set "us_vhost_path" to the | |
839 | location of that device. The script will automatically detect and insert | |
7d1ced01 | 840 | the correct "vhostfd" value in the QEMU command line arguments. |
58397e6c KT |
841 | |
842 | 5. Use virt-manager to launch the VM | |
843 | ||
9899125a OS |
844 | Running ovs-vswitchd with DPDK backend inside a VM |
845 | -------------------------------------------------- | |
846 | ||
847 | Please note that additional configuration is required if you want to run | |
848 | ovs-vswitchd with DPDK backend inside a QEMU virtual machine. Ovs-vswitchd | |
849 | creates separate DPDK TX queues for each CPU core available. This operation | |
850 | fails inside QEMU virtual machine because, by default, VirtIO NIC provided | |
851 | to the guest is configured to support only single TX queue and single RX | |
852 | queue. To change this behavior, you need to turn on 'mq' (multiqueue) | |
853 | property of all virtio-net-pci devices emulated by QEMU and used by DPDK. | |
854 | You may do it manually (by changing QEMU command line) or, if you use Libvirt, | |
855 | by adding the following string: | |
856 | ||
857 | `<driver name='vhost' queues='N'/>` | |
858 | ||
859 | to <interface> sections of all network devices used by DPDK. Parameter 'N' | |
860 | determines how many queues can be used by the guest. | |
861 | ||
542cc9bb TG |
862 | Restrictions: |
863 | ------------- | |
864 | ||
542cc9bb TG |
865 | - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue. |
866 | - Currently DPDK port does not make use any offload functionality. | |
58397e6c | 867 | - DPDK-vHost support works with 1G huge pages. |
542cc9bb TG |
868 | |
869 | ivshmem: | |
3088fab7 MG |
870 | - If you run Open vSwitch with smaller page sizes (e.g. 2MB), you may be |
871 | unable to share any rings or mempools with a virtual machine. | |
872 | This is because the current implementation of ivshmem works by sharing | |
873 | a single 1GB huge page from the host operating system to any guest | |
874 | operating system through the Qemu ivshmem device. When using smaller | |
875 | page sizes, multiple pages may be required to hold the ring descriptors | |
876 | and buffer pools. The Qemu ivshmem device does not allow you to share | |
877 | multiple file descriptors to the guest operating system. However, if you | |
878 | want to share dpdkr rings with other processes on the host, you can do | |
879 | this with smaller page sizes. | |
542cc9bb | 880 | |
1e77bbe5 IS |
881 | Platform and Network Interface: |
882 | - Currently it is not possible to use an Intel XL710 Network Interface as a | |
883 | DPDK port type on a platform with more than 64 logical cores. This is | |
884 | related to how DPDK reports the number of TX queues that may be used by | |
885 | a DPDK application with an XL710. The maximum number of TX queues supported | |
886 | by a DPDK application for an XL710 is 64. If a user attempts to add an | |
887 | XL710 interface as a DPDK port type to a system as described above the | |
888 | port addition will fail as OVS will attempt to initialize a TX queue greater | |
889 | than 64. This issue is expected to be resolved in a future DPDK release. | |
890 | As a workaround a user can disable hyper-threading to reduce the overall | |
891 | core count of the system to be less than or equal to 64 when using an XL710 | |
892 | interface with DPDK. | |
893 | ||
e73b7508 CL |
894 | vHost and QEMU v2.4.0+: |
895 | - For versions of QEMU v2.4.0 and later, it is currently not possible to | |
896 | unbind more than one dpdkvhostuser port from the guest kernel driver | |
897 | without causing the ovs-vswitchd process to crash. If this is a requirement | |
898 | for your use case, it is recommended either to use a version of QEMU | |
899 | between v2.2.0 and v2.3.1 (inclusive), or alternatively, to apply the | |
900 | following patch to DPDK and rebuild: | |
901 | http://dpdk.org/dev/patchwork/patch/7736/ | |
902 | This problem will likely be resolved in Open vSwitch at a later date, when | |
903 | the next release of DPDK (which includes the above patch) is available and | |
904 | integrated into OVS. | |
905 | ||
542cc9bb TG |
906 | Bug Reporting: |
907 | -------------- | |
908 | ||
909 | Please report problems to bugs@openvswitch.org. | |
9feb1017 TG |
910 | |
911 | [INSTALL.userspace.md]:INSTALL.userspace.md | |
912 | [INSTALL.md]:INSTALL.md | |
491c2ea3 | 913 | [DPDK Linux GSG]: http://www.dpdk.org/doc/guides/linux_gsg/build_dpdk.html#binding-and-unbinding-network-ports-to-from-the-igb-uioor-vfio-modules |
58397e6c | 914 | [DPDK Docs]: http://dpdk.org/doc |