]>
Commit | Line | Data |
---|---|---|
542cc9bb TG |
1 | Using Open vSwitch with DPDK |
2 | ============================ | |
3 | ||
4 | Open vSwitch can use Intel(R) DPDK lib to operate entirely in | |
5 | userspace. This file explains how to install and use Open vSwitch in | |
6 | such a mode. | |
7 | ||
8 | The DPDK support of Open vSwitch is considered experimental. | |
9 | It has not been thoroughly tested. | |
10 | ||
11 | This version of Open vSwitch should be built manually with `configure` | |
12 | and `make`. | |
13 | ||
14 | OVS needs a system with 1GB hugepages support. | |
15 | ||
16 | Building and Installing: | |
17 | ------------------------ | |
18 | ||
02ab4b1a | 19 | Required: DPDK 2.2 |
7d1ced01 CL |
20 | Optional (if building with vhost-cuse): `fuse`, `fuse-devel` (`libfuse-dev` |
21 | on Debian/Ubuntu) | |
542cc9bb TG |
22 | |
23 | 1. Configure build & install DPDK: | |
24 | 1. Set `$DPDK_DIR` | |
25 | ||
26 | ``` | |
02ab4b1a | 27 | export DPDK_DIR=/usr/src/dpdk-2.2 |
542cc9bb TG |
28 | cd $DPDK_DIR |
29 | ``` | |
30 | ||
31 | 2. Update `config/common_linuxapp` so that DPDK generate single lib file. | |
32 | (modification also required for IVSHMEM build) | |
33 | ||
34 | `CONFIG_RTE_BUILD_COMBINE_LIBS=y` | |
35 | ||
777cb787 | 36 | Then run `make install` to build and install the library. |
542cc9bb TG |
37 | For default install without IVSHMEM: |
38 | ||
39 | `make install T=x86_64-native-linuxapp-gcc` | |
40 | ||
41 | To include IVSHMEM (shared memory): | |
42 | ||
43 | `make install T=x86_64-ivshmem-linuxapp-gcc` | |
44 | ||
45 | For further details refer to http://dpdk.org/ | |
46 | ||
47 | 2. Configure & build the Linux kernel: | |
48 | ||
49 | Refer to intel-dpdk-getting-started-guide.pdf for understanding | |
50 | DPDK kernel requirement. | |
51 | ||
52 | 3. Configure & build OVS: | |
53 | ||
54 | * Non IVSHMEM: | |
55 | ||
56 | `export DPDK_BUILD=$DPDK_DIR/x86_64-native-linuxapp-gcc/` | |
57 | ||
58 | * IVSHMEM: | |
59 | ||
60 | `export DPDK_BUILD=$DPDK_DIR/x86_64-ivshmem-linuxapp-gcc/` | |
61 | ||
62 | ``` | |
15b612f8 | 63 | cd $(OVS_DIR)/ |
542cc9bb | 64 | ./boot.sh |
543342a4 | 65 | ./configure --with-dpdk=$DPDK_BUILD [CFLAGS="-g -O2 -Wno-cast-align"] |
542cc9bb TG |
66 | make |
67 | ``` | |
68 | ||
543342a4 MK |
69 | Note: 'clang' users may specify the '-Wno-cast-align' flag to suppress DPDK cast-align warnings. |
70 | ||
542cc9bb TG |
71 | To have better performance one can enable aggressive compiler optimizations and |
72 | use the special instructions(popcnt, crc32) that may not be available on all | |
73 | machines. Instead of typing `make`, type: | |
74 | ||
75 | `make CFLAGS='-O3 -march=native'` | |
76 | ||
9feb1017 | 77 | Refer to [INSTALL.userspace.md] for general requirements of building userspace OVS. |
542cc9bb TG |
78 | |
79 | Using the DPDK with ovs-vswitchd: | |
80 | --------------------------------- | |
81 | ||
82 | 1. Setup system boot | |
83 | Add the following options to the kernel bootline: | |
84 | ||
85 | `default_hugepagesz=1GB hugepagesz=1G hugepages=1` | |
86 | ||
87 | 2. Setup DPDK devices: | |
491c2ea3 MG |
88 | |
89 | DPDK devices can be setup using either the VFIO (for DPDK 1.7+) or UIO | |
90 | modules. UIO requires inserting an out of tree driver igb_uio.ko that is | |
91 | available in DPDK. Setup for both methods are described below. | |
92 | ||
93 | * UIO: | |
94 | 1. insert uio.ko: `modprobe uio` | |
95 | 2. insert igb_uio.ko: `insmod $DPDK_BUILD/kmod/igb_uio.ko` | |
96 | 3. Bind network device to igb_uio: | |
dbde55e7 | 97 | `$DPDK_DIR/tools/dpdk_nic_bind.py --bind=igb_uio eth1` |
491c2ea3 MG |
98 | |
99 | * VFIO: | |
100 | ||
101 | VFIO needs to be supported in the kernel and the BIOS. More information | |
102 | can be found in the [DPDK Linux GSG]. | |
103 | ||
104 | 1. Insert vfio-pci.ko: `modprobe vfio-pci` | |
105 | 2. Set correct permissions on vfio device: `sudo /usr/bin/chmod a+x /dev/vfio` | |
106 | and: `sudo /usr/bin/chmod 0666 /dev/vfio/*` | |
107 | 3. Bind network device to vfio-pci: | |
dbde55e7 | 108 | `$DPDK_DIR/tools/dpdk_nic_bind.py --bind=vfio-pci eth1` |
542cc9bb | 109 | |
18f777b2 | 110 | 3. Mount the hugetable filesystem |
542cc9bb TG |
111 | |
112 | `mount -t hugetlbfs -o pagesize=1G none /dev/hugepages` | |
113 | ||
114 | Ref to http://www.dpdk.org/doc/quick-start for verifying DPDK setup. | |
115 | ||
a52b0492 GS |
116 | 4. Follow the instructions in [INSTALL.md] to install only the |
117 | userspace daemons and utilities (via 'make install'). | |
542cc9bb TG |
118 | 1. First time only db creation (or clearing): |
119 | ||
a52b0492 GS |
120 | ``` |
121 | mkdir -p /usr/local/etc/openvswitch | |
122 | mkdir -p /usr/local/var/run/openvswitch | |
123 | rm /usr/local/etc/openvswitch/conf.db | |
124 | ovsdb-tool create /usr/local/etc/openvswitch/conf.db \ | |
125 | /usr/local/share/openvswitch/vswitch.ovsschema | |
126 | ``` | |
542cc9bb | 127 | |
a52b0492 | 128 | 2. Start ovsdb-server |
542cc9bb | 129 | |
a52b0492 GS |
130 | ``` |
131 | ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock \ | |
542cc9bb TG |
132 | --remote=db:Open_vSwitch,Open_vSwitch,manager_options \ |
133 | --private-key=db:Open_vSwitch,SSL,private_key \ | |
134 | --certificate=Open_vSwitch,SSL,certificate \ | |
135 | --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --pidfile --detach | |
a52b0492 | 136 | ``` |
542cc9bb TG |
137 | |
138 | 3. First time after db creation, initialize: | |
139 | ||
a52b0492 GS |
140 | ``` |
141 | ovs-vsctl --no-wait init | |
142 | ``` | |
542cc9bb TG |
143 | |
144 | 5. Start vswitchd: | |
145 | ||
146 | DPDK configuration arguments can be passed to vswitchd via `--dpdk` | |
147 | argument. This needs to be first argument passed to vswitchd process. | |
148 | dpdk arg -c is ignored by ovs-dpdk, but it is a required parameter | |
149 | for dpdk initialization. | |
150 | ||
a52b0492 GS |
151 | ``` |
152 | export DB_SOCK=/usr/local/var/run/openvswitch/db.sock | |
153 | ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK --pidfile --detach | |
154 | ``` | |
542cc9bb | 155 | |
a52b0492 GS |
156 | If allocated more than one GB hugepage (as for IVSHMEM), set amount and |
157 | use NUMA node 0 memory: | |
542cc9bb | 158 | |
a52b0492 GS |
159 | ``` |
160 | ovs-vswitchd --dpdk -c 0x1 -n 4 --socket-mem 1024,0 \ | |
161 | -- unix:$DB_SOCK --pidfile --detach | |
162 | ``` | |
542cc9bb TG |
163 | |
164 | 6. Add bridge & ports | |
b8e57534 | 165 | |
542cc9bb TG |
166 | To use ovs-vswitchd with DPDK, create a bridge with datapath_type |
167 | "netdev" in the configuration database. For example: | |
168 | ||
a52b0492 | 169 | `ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev` |
542cc9bb | 170 | |
f748d99a RB |
171 | Now you can add dpdk devices. OVS expects DPDK device names to start with |
172 | "dpdk" and end with a portid. vswitchd should print (in the log file) the | |
173 | number of dpdk devices found. | |
542cc9bb | 174 | |
a52b0492 GS |
175 | ``` |
176 | ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk | |
177 | ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk | |
178 | ``` | |
542cc9bb | 179 | |
a52b0492 GS |
180 | Once first DPDK port is added to vswitchd, it creates a Polling thread and |
181 | polls dpdk device in continuous loop. Therefore CPU utilization | |
182 | for that thread is always 100%. | |
542cc9bb | 183 | |
77c180ce BM |
184 | Note: creating bonds of DPDK interfaces is slightly different to creating |
185 | bonds of system interfaces. For DPDK, the interface type must be explicitly | |
186 | set, for example: | |
187 | ||
188 | ``` | |
189 | ovs-vsctl add-bond br0 dpdkbond dpdk0 dpdk1 -- set Interface dpdk0 type=dpdk -- set Interface dpdk1 type=dpdk | |
190 | ``` | |
191 | ||
542cc9bb TG |
192 | 7. Add test flows |
193 | ||
194 | Test flow script across NICs (assuming ovs in /usr/src/ovs): | |
195 | Execute script: | |
196 | ||
197 | ``` | |
198 | #! /bin/sh | |
199 | # Move to command directory | |
200 | cd /usr/src/ovs/utilities/ | |
201 | ||
202 | # Clear current flows | |
203 | ./ovs-ofctl del-flows br0 | |
204 | ||
205 | # Add flows between port 1 (dpdk0) to port 2 (dpdk1) | |
206 | ./ovs-ofctl add-flow br0 in_port=1,action=output:2 | |
207 | ./ovs-ofctl add-flow br0 in_port=2,action=output:1 | |
208 | ``` | |
209 | ||
188d29d7 KT |
210 | Performance Tuning: |
211 | ------------------- | |
542cc9bb | 212 | |
188d29d7 | 213 | 1. PMD affinitization |
542cc9bb | 214 | |
188d29d7 KT |
215 | A poll mode driver (pmd) thread handles the I/O of all DPDK |
216 | interfaces assigned to it. A pmd thread will busy loop through | |
217 | the assigned port/rxq's polling for packets, switch the packets | |
218 | and send to a tx port if required. Typically, it is found that | |
219 | a pmd thread is CPU bound, meaning that the greater the CPU | |
220 | occupancy the pmd thread can get, the better the performance. To | |
221 | that end, it is good practice to ensure that a pmd thread has as | |
222 | many cycles on a core available to it as possible. This can be | |
223 | achieved by affinitizing the pmd thread with a core that has no | |
224 | other workload. See section 7 below for a description of how to | |
225 | isolate cores for this purpose also. | |
542cc9bb | 226 | |
188d29d7 KT |
227 | The following command can be used to specify the affinity of the |
228 | pmd thread(s). | |
542cc9bb | 229 | |
188d29d7 | 230 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=<hex string>` |
542cc9bb | 231 | |
188d29d7 KT |
232 | By setting a bit in the mask, a pmd thread is created and pinned |
233 | to the corresponding CPU core. e.g. to run a pmd thread on core 1 | |
542cc9bb | 234 | |
188d29d7 | 235 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=2` |
542cc9bb | 236 | |
188d29d7 | 237 | For more information, please refer to the Open_vSwitch TABLE section in |
542cc9bb | 238 | |
188d29d7 | 239 | `man ovs-vswitchd.conf.db` |
542cc9bb | 240 | |
188d29d7 KT |
241 | Note, that a pmd thread on a NUMA node is only created if there is |
242 | at least one DPDK interface from that NUMA node added to OVS. | |
542cc9bb | 243 | |
188d29d7 | 244 | 2. Multiple poll mode driver threads |
542cc9bb | 245 | |
188d29d7 KT |
246 | With pmd multi-threading support, OVS creates one pmd thread |
247 | for each NUMA node by default. However, it can be seen that in cases | |
248 | where there are multiple ports/rxq's producing traffic, performance | |
249 | can be improved by creating multiple pmd threads running on separate | |
250 | cores. These pmd threads can then share the workload by each being | |
251 | responsible for different ports/rxq's. Assignment of ports/rxq's to | |
252 | pmd threads is done automatically. | |
542cc9bb | 253 | |
188d29d7 KT |
254 | The following command can be used to specify the affinity of the |
255 | pmd threads. | |
542cc9bb | 256 | |
188d29d7 KT |
257 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=<hex string>` |
258 | ||
259 | A set bit in the mask means a pmd thread is created and pinned | |
260 | to the corresponding CPU core. e.g. to run pmd threads on core 1 and 2 | |
261 | ||
262 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=6` | |
263 | ||
264 | For more information, please refer to the Open_vSwitch TABLE section in | |
265 | ||
266 | `man ovs-vswitchd.conf.db` | |
267 | ||
268 | For example, when using dpdk and dpdkvhostuser ports in a bi-directional | |
269 | VM loopback as shown below, spreading the workload over 2 or 4 pmd | |
270 | threads shows significant improvements as there will be more total CPU | |
271 | occupancy available. | |
272 | ||
273 | NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1 | |
274 | ||
ce179f11 IM |
275 | The following command can be used to confirm that the port/rxq assignment |
276 | to pmd threads is as required: | |
277 | ||
278 | `ovs-appctl dpif-netdev/pmd-rxq-show` | |
279 | ||
280 | This can also be checked with: | |
188d29d7 KT |
281 | |
282 | ``` | |
283 | top -H | |
284 | taskset -p <pid_of_pmd> | |
285 | ``` | |
286 | ||
287 | To understand where most of the pmd thread time is spent and whether the | |
288 | caches are being utilized, these commands can be used: | |
289 | ||
290 | ``` | |
291 | # Clear previous stats | |
292 | ovs-appctl dpif-netdev/pmd-stats-clear | |
293 | ||
294 | # Check current stats | |
295 | ovs-appctl dpif-netdev/pmd-stats-show | |
296 | ``` | |
297 | ||
298 | 3. DPDK port Rx Queues | |
299 | ||
a14b8947 | 300 | `ovs-vsctl set Interface <DPDK interface> options:n_rxq=<integer>` |
188d29d7 | 301 | |
a14b8947 | 302 | The command above sets the number of rx queues for DPDK interface. |
188d29d7 KT |
303 | The rx queues are assigned to pmd threads on the same NUMA node in a |
304 | round-robin fashion. For more information, please refer to the | |
305 | Open_vSwitch TABLE section in | |
306 | ||
307 | `man ovs-vswitchd.conf.db` | |
308 | ||
309 | 4. Exact Match Cache | |
310 | ||
311 | Each pmd thread contains one EMC. After initial flow setup in the | |
312 | datapath, the EMC contains a single table and provides the lowest level | |
313 | (fastest) switching for DPDK ports. If there is a miss in the EMC then | |
314 | the next level where switching will occur is the datapath classifier. | |
315 | Missing in the EMC and looking up in the datapath classifier incurs a | |
316 | significant performance penalty. If lookup misses occur in the EMC | |
317 | because it is too small to handle the number of flows, its size can | |
318 | be increased. The EMC size can be modified by editing the define | |
319 | EM_FLOW_HASH_SHIFT in lib/dpif-netdev.c. | |
320 | ||
321 | As mentioned above an EMC is per pmd thread. So an alternative way of | |
322 | increasing the aggregate amount of possible flow entries in EMC and | |
323 | avoiding datapath classifier lookups is to have multiple pmd threads | |
324 | running. This can be done as described in section 2. | |
325 | ||
326 | 5. Compiler options | |
327 | ||
328 | The default compiler optimization level is '-O2'. Changing this to | |
329 | more aggressive compiler optimizations such as '-O3' or | |
330 | '-Ofast -march=native' with gcc can produce performance gains. | |
331 | ||
332 | 6. Simultaneous Multithreading (SMT) | |
333 | ||
334 | With SMT enabled, one physical core appears as two logical cores | |
335 | which can improve performance. | |
336 | ||
337 | SMT can be utilized to add additional pmd threads without consuming | |
338 | additional physical cores. Additional pmd threads may be added in the | |
339 | same manner as described in section 2. If trying to minimize the use | |
340 | of physical cores for pmd threads, care must be taken to set the | |
341 | correct bits in the pmd-cpu-mask to ensure that the pmd threads are | |
342 | pinned to SMT siblings. | |
343 | ||
344 | For example, when using 2x 10 core processors in a dual socket system | |
345 | with HT enabled, /proc/cpuinfo will report 40 logical cores. To use | |
346 | two logical cores which share the same physical core for pmd threads, | |
347 | the following command can be used to identify a pair of logical cores. | |
348 | ||
349 | `cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list` | |
350 | ||
351 | where N is the logical core number. In this example, it would show that | |
352 | cores 1 and 21 share the same physical core. The pmd-cpu-mask to enable | |
353 | two pmd threads running on these two logical cores (one physical core) | |
354 | is. | |
355 | ||
356 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=100002` | |
357 | ||
358 | Note that SMT is enabled by the Hyper-Threading section in the | |
359 | BIOS, and as such will apply to the whole system. So the impact of | |
360 | enabling/disabling it for the whole system should be considered | |
361 | e.g. If workloads on the system can scale across multiple cores, | |
362 | SMT may very beneficial. However, if they do not and perform best | |
363 | on a single physical core, SMT may not be beneficial. | |
364 | ||
365 | 7. The isolcpus kernel boot parameter | |
366 | ||
367 | isolcpus can be used on the kernel bootline to isolate cores from the | |
368 | kernel scheduler and hence dedicate them to OVS or other packet | |
369 | forwarding related workloads. For example a Linux kernel boot-line | |
370 | could be: | |
371 | ||
372 | 'GRUB_CMDLINE_LINUX_DEFAULT="quiet hugepagesz=1G hugepages=4 default_hugepagesz=1G 'intel_iommu=off' isolcpus=1-19"' | |
373 | ||
374 | 8. NUMA/Cluster On Die | |
375 | ||
376 | Ideally inter NUMA datapaths should be avoided where possible as packets | |
377 | will go across QPI and there may be a slight performance penalty when | |
378 | compared with intra NUMA datapaths. On Intel Xeon Processor E5 v3, | |
379 | Cluster On Die is introduced on models that have 10 cores or more. | |
380 | This makes it possible to logically split a socket into two NUMA regions | |
381 | and again it is preferred where possible to keep critical datapaths | |
382 | within the one cluster. | |
383 | ||
384 | It is good practice to ensure that threads that are in the datapath are | |
385 | pinned to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs | |
386 | responsible for forwarding. | |
387 | ||
388 | 9. Rx Mergeable buffers | |
389 | ||
390 | Rx Mergeable buffers is a virtio feature that allows chaining of multiple | |
391 | virtio descriptors to handle large packet sizes. As such, large packets | |
392 | are handled by reserving and chaining multiple free descriptors | |
393 | together. Mergeable buffer support is negotiated between the virtio | |
394 | driver and virtio device and is supported by the DPDK vhost library. | |
395 | This behavior is typically supported and enabled by default, however | |
396 | in the case where the user knows that rx mergeable buffers are not needed | |
397 | i.e. jumbo frames are not needed, it can be forced off by adding | |
de658847 | 398 | mrg_rxbuf=off to the QEMU command line options. By not reserving multiple |
188d29d7 KT |
399 | chains of descriptors it will make more individual virtio descriptors |
400 | available for rx to the guest using dpdkvhost ports and this can improve | |
401 | performance. | |
402 | ||
403 | 10. Packet processing in the guest | |
404 | ||
405 | It is good practice whether simply forwarding packets from one | |
406 | interface to another or more complex packet processing in the guest, | |
407 | to ensure that the thread performing this work has as much CPU | |
408 | occupancy as possible. For example when the DPDK sample application | |
409 | `testpmd` is used to forward packets in the guest, multiple QEMU vCPU | |
410 | threads can be created. Taskset can then be used to affinitize the | |
411 | vCPU thread responsible for forwarding to a dedicated core not used | |
412 | for other general processing on the host system. | |
413 | ||
414 | 11. DPDK virtio pmd in the guest | |
415 | ||
416 | dpdkvhostcuse or dpdkvhostuser ports can be used to accelerate the path | |
417 | to the guest using the DPDK vhost library. This library is compatible with | |
418 | virtio-net drivers in the guest but significantly better performance can | |
419 | be observed when using the DPDK virtio pmd driver in the guest. The DPDK | |
420 | `testpmd` application can be used in the guest as an example application | |
421 | that forwards packet from one DPDK vhost port to another. An example of | |
422 | running `testpmd` in the guest can be seen here. | |
423 | ||
424 | `./testpmd -c 0x3 -n 4 --socket-mem 512 -- --burst=64 -i --txqflags=0xf00 --disable-hw-vlan --forward-mode=io --auto-start` | |
425 | ||
426 | See below information on dpdkvhostcuse and dpdkvhostuser ports. | |
427 | See [DPDK Docs] for more information on `testpmd`. | |
542cc9bb | 428 | |
6553d06b | 429 | |
6553d06b | 430 | |
542cc9bb TG |
431 | DPDK Rings : |
432 | ------------ | |
433 | ||
434 | Following the steps above to create a bridge, you can now add dpdk rings | |
435 | as a port to the vswitch. OVS will expect the DPDK ring device name to | |
436 | start with dpdkr and end with a portid. | |
437 | ||
a52b0492 | 438 | `ovs-vsctl add-port br0 dpdkr0 -- set Interface dpdkr0 type=dpdkr` |
542cc9bb TG |
439 | |
440 | DPDK rings client test application | |
441 | ||
442 | Included in the test directory is a sample DPDK application for testing | |
443 | the rings. This is from the base dpdk directory and modified to work | |
444 | with the ring naming used within ovs. | |
445 | ||
446 | location tests/ovs_client | |
447 | ||
448 | To run the client : | |
449 | ||
a52b0492 GS |
450 | ``` |
451 | cd /usr/src/ovs/tests/ | |
452 | ovsclient -c 1 -n 4 --proc-type=secondary -- -n "port id you gave dpdkr" | |
453 | ``` | |
542cc9bb TG |
454 | |
455 | In the case of the dpdkr example above the "port id you gave dpdkr" is 0. | |
456 | ||
457 | It is essential to have --proc-type=secondary | |
458 | ||
459 | The application simply receives an mbuf on the receive queue of the | |
460 | ethernet ring and then places that same mbuf on the transmit ring of | |
461 | the ethernet ring. It is a trivial loopback application. | |
462 | ||
463 | DPDK rings in VM (IVSHMEM shared memory communications) | |
464 | ------------------------------------------------------- | |
465 | ||
466 | In addition to executing the client in the host, you can execute it within | |
467 | a guest VM. To do so you will need a patched qemu. You can download the | |
468 | patch and getting started guide at : | |
469 | ||
470 | https://01.org/packet-processing/downloads | |
471 | ||
472 | A general rule of thumb for better performance is that the client | |
473 | application should not be assigned the same dpdk core mask "-c" as | |
474 | the vswitchd. | |
475 | ||
58397e6c KT |
476 | DPDK vhost: |
477 | ----------- | |
478 | ||
02ab4b1a | 479 | DPDK 2.2 supports two types of vhost: |
58397e6c | 480 | |
7d1ced01 CL |
481 | 1. vhost-user |
482 | 2. vhost-cuse | |
58397e6c | 483 | |
7d1ced01 CL |
484 | Whatever type of vhost is enabled in the DPDK build specified, is the type |
485 | that will be enabled in OVS. By default, vhost-user is enabled in DPDK. | |
486 | Therefore, unless vhost-cuse has been enabled in DPDK, vhost-user ports | |
487 | will be enabled in OVS. | |
488 | Please note that support for vhost-cuse is intended to be deprecated in OVS | |
489 | in a future release. | |
58397e6c | 490 | |
7d1ced01 CL |
491 | DPDK vhost-user: |
492 | ---------------- | |
58397e6c | 493 | |
7d1ced01 CL |
494 | The following sections describe the use of vhost-user 'dpdkvhostuser' ports |
495 | with OVS. | |
58397e6c | 496 | |
7d1ced01 CL |
497 | DPDK vhost-user Prerequisites: |
498 | ------------------------- | |
58397e6c | 499 | |
02ab4b1a | 500 | 1. DPDK 2.2 with vhost support enabled as documented in the "Building and |
7d1ced01 | 501 | Installing section" |
58397e6c | 502 | |
7d1ced01 | 503 | 2. QEMU version v2.1.0+ |
58397e6c | 504 | |
7d1ced01 CL |
505 | QEMU v2.1.0 will suffice, but it is recommended to use v2.2.0 if providing |
506 | your VM with memory greater than 1GB due to potential issues with memory | |
507 | mapping larger areas. | |
58397e6c | 508 | |
7d1ced01 CL |
509 | Adding DPDK vhost-user ports to the Switch: |
510 | -------------------------------------- | |
58397e6c | 511 | |
7d1ced01 CL |
512 | Following the steps above to create a bridge, you can now add DPDK vhost-user |
513 | as a port to the vswitch. Unlike DPDK ring ports, DPDK vhost-user ports can | |
1af27e8a DDP |
514 | have arbitrary names, except that forward and backward slashes are prohibited |
515 | in the names. | |
58397e6c | 516 | |
7d1ced01 | 517 | - For vhost-user, the name of the port type is `dpdkvhostuser` |
58397e6c | 518 | |
7d1ced01 | 519 | ``` |
1af65cc7 | 520 | ovs-vsctl add-port br0 vhost-user-1 -- set Interface vhost-user-1 |
7d1ced01 CL |
521 | type=dpdkvhostuser |
522 | ``` | |
523 | ||
524 | This action creates a socket located at | |
525 | `/usr/local/var/run/openvswitch/vhost-user-1`, which you must provide | |
526 | to your VM on the QEMU command line. More instructions on this can be | |
527 | found in the next section "DPDK vhost-user VM configuration" | |
528 | Note: If you wish for the vhost-user sockets to be created in a | |
529 | directory other than `/usr/local/var/run/openvswitch`, you may specify | |
530 | another location on the ovs-vswitchd command line like so: | |
531 | ||
532 | `./vswitchd/ovs-vswitchd --dpdk -vhost_sock_dir /my-dir -c 0x1 ...` | |
533 | ||
534 | DPDK vhost-user VM configuration: | |
535 | --------------------------------- | |
536 | Follow the steps below to attach vhost-user port(s) to a VM. | |
537 | ||
538 | 1. Configure sockets. | |
539 | Pass the following parameters to QEMU to attach a vhost-user device: | |
540 | ||
541 | ``` | |
542 | -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 | |
543 | -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce | |
544 | -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 | |
545 | ``` | |
546 | ||
547 | ...where vhost-user-1 is the name of the vhost-user port added | |
548 | to the switch. | |
549 | Repeat the above parameters for multiple devices, changing the | |
550 | chardev path and id as necessary. Note that a separate and different | |
551 | chardev path needs to be specified for each vhost-user device. For | |
552 | example you have a second vhost-user port named 'vhost-user-2', you | |
553 | append your QEMU command line with an additional set of parameters: | |
554 | ||
555 | ``` | |
556 | -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 | |
557 | -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce | |
558 | -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2 | |
559 | ``` | |
560 | ||
561 | 2. Configure huge pages. | |
562 | QEMU must allocate the VM's memory on hugetlbfs. vhost-user ports access | |
563 | a virtio-net device's virtual rings and packet buffers mapping the VM's | |
564 | physical memory on hugetlbfs. To enable vhost-user ports to map the VM's | |
565 | memory into their process address space, pass the following paramters | |
566 | to QEMU: | |
567 | ||
568 | ``` | |
569 | -object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages, | |
570 | share=on | |
571 | -numa node,memdev=mem -mem-prealloc | |
572 | ``` | |
573 | ||
4573fbd3 | 574 | 3. Optional: Enable multiqueue support |
a14b8947 IM |
575 | The vhost-user interface must be configured in Open vSwitch with the |
576 | desired amount of queues with: | |
577 | ||
578 | ``` | |
579 | ovs-vsctl set Interface vhost-user-2 options:n_rxq=<requested queues> | |
580 | ``` | |
581 | ||
582 | QEMU needs to be configured as well. | |
583 | The $q below should match the queues requested in OVS (if $q is more, | |
584 | packets will not be received). | |
4573fbd3 FL |
585 | The $v is the number of vectors, which is '$q x 2 + 2'. |
586 | ||
587 | ``` | |
588 | -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 | |
589 | -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=$q | |
590 | -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=$v | |
591 | ``` | |
592 | ||
7d1ced01 CL |
593 | DPDK vhost-cuse: |
594 | ---------------- | |
595 | ||
596 | The following sections describe the use of vhost-cuse 'dpdkvhostcuse' ports | |
597 | with OVS. | |
598 | ||
599 | DPDK vhost-cuse Prerequisites: | |
600 | ------------------------- | |
601 | ||
02ab4b1a | 602 | 1. DPDK 2.2 with vhost support enabled as documented in the "Building and |
7d1ced01 CL |
603 | Installing section" |
604 | As an additional step, you must enable vhost-cuse in DPDK by setting the | |
605 | following additional flag in `config/common_linuxapp`: | |
606 | ||
607 | `CONFIG_RTE_LIBRTE_VHOST_USER=n` | |
608 | ||
609 | Following this, rebuild DPDK as per the instructions in the "Building and | |
610 | Installing" section. Finally, rebuild OVS as per step 3 in the "Building | |
611 | and Installing" section - OVS will detect that DPDK has vhost-cuse libraries | |
612 | compiled and in turn will enable support for it in the switch and disable | |
613 | vhost-user support. | |
614 | ||
615 | 2. Insert the Cuse module: | |
616 | ||
617 | `modprobe cuse` | |
618 | ||
619 | 3. Build and insert the `eventfd_link` module: | |
620 | ||
621 | ``` | |
622 | cd $DPDK_DIR/lib/librte_vhost/eventfd_link/ | |
623 | make | |
624 | insmod $DPDK_DIR/lib/librte_vhost/eventfd_link.ko | |
625 | ``` | |
626 | ||
627 | 4. QEMU version v2.1.0+ | |
628 | ||
629 | vhost-cuse will work with QEMU v2.1.0 and above, however it is recommended to | |
630 | use v2.2.0 if providing your VM with memory greater than 1GB due to potential | |
631 | issues with memory mapping larger areas. | |
632 | Note: QEMU v1.6.2 will also work, with slightly different command line parameters, | |
633 | which are specified later in this document. | |
634 | ||
635 | Adding DPDK vhost-cuse ports to the Switch: | |
636 | -------------------------------------- | |
637 | ||
638 | Following the steps above to create a bridge, you can now add DPDK vhost-cuse | |
639 | as a port to the vswitch. Unlike DPDK ring ports, DPDK vhost-cuse ports can have | |
640 | arbitrary names. | |
641 | ||
642 | - For vhost-cuse, the name of the port type is `dpdkvhostcuse` | |
643 | ||
644 | ``` | |
1af65cc7 | 645 | ovs-vsctl add-port br0 vhost-cuse-1 -- set Interface vhost-cuse-1 |
7d1ced01 CL |
646 | type=dpdkvhostcuse |
647 | ``` | |
648 | ||
649 | When attaching vhost-cuse ports to QEMU, the name provided during the | |
650 | add-port operation must match the ifname parameter on the QEMU command | |
651 | line. More instructions on this can be found in the next section. | |
652 | ||
653 | DPDK vhost-cuse VM configuration: | |
654 | --------------------------------- | |
655 | ||
656 | vhost-cuse ports use a Linux* character device to communicate with QEMU. | |
58397e6c KT |
657 | By default it is set to `/dev/vhost-net`. It is possible to reuse this |
658 | standard device for DPDK vhost, which makes setup a little simpler but it | |
659 | is better practice to specify an alternative character device in order to | |
660 | avoid any conflicts if kernel vhost is to be used in parallel. | |
661 | ||
662 | 1. This step is only needed if using an alternative character device. | |
663 | ||
664 | The new character device filename must be specified on the vswitchd | |
665 | commandline: | |
666 | ||
667 | `./vswitchd/ovs-vswitchd --dpdk --cuse_dev_name my-vhost-net -c 0x1 ...` | |
668 | ||
669 | Note that the `--cuse_dev_name` argument and associated string must be the first | |
670 | arguments after `--dpdk` and come before the EAL arguments. In the example | |
671 | above, the character device to be used will be `/dev/my-vhost-net`. | |
672 | ||
673 | 2. This step is only needed if reusing the standard character device. It will | |
674 | conflict with the kernel vhost character device so the user must first | |
675 | remove it. | |
676 | ||
677 | `rm -rf /dev/vhost-net` | |
678 | ||
679 | 3a. Configure virtio-net adaptors: | |
680 | The following parameters must be passed to the QEMU binary: | |
681 | ||
682 | ``` | |
683 | -netdev tap,id=<id>,script=no,downscript=no,ifname=<name>,vhost=on | |
684 | -device virtio-net-pci,netdev=net1,mac=<mac> | |
685 | ``` | |
686 | ||
687 | Repeat the above parameters for multiple devices. | |
688 | ||
689 | The DPDK vhost library will negiotiate its own features, so they | |
690 | need not be passed in as command line params. Note that as offloads are | |
691 | disabled this is the equivalent of setting: | |
692 | ||
693 | `csum=off,gso=off,guest_tso4=off,guest_tso6=off,guest_ecn=off` | |
694 | ||
695 | 3b. If using an alternative character device. It must be also explicitly | |
696 | passed to QEMU using the `vhostfd` argument: | |
697 | ||
698 | ``` | |
699 | -netdev tap,id=<id>,script=no,downscript=no,ifname=<name>,vhost=on, | |
700 | vhostfd=<open_fd> | |
701 | -device virtio-net-pci,netdev=net1,mac=<mac> | |
702 | ``` | |
703 | ||
704 | The open file descriptor must be passed to QEMU running as a child | |
705 | process. This could be done with a simple python script. | |
706 | ||
707 | ``` | |
708 | #!/usr/bin/python | |
709 | fd = os.open("/dev/usvhost", os.O_RDWR) | |
710 | subprocess.call("qemu-system-x86_64 .... -netdev tap,id=vhostnet0,\ | |
711 | vhost=on,vhostfd=" + fd +"...", shell=True) | |
712 | ||
898dcef1 | 713 | Alternatively the `qemu-wrap.py` script can be used to automate the |
58397e6c KT |
714 | requirements specified above and can be used in conjunction with libvirt if |
715 | desired. See the "DPDK vhost VM configuration with QEMU wrapper" section | |
716 | below. | |
717 | ||
718 | 4. Configure huge pages: | |
719 | QEMU must allocate the VM's memory on hugetlbfs. Vhost ports access a | |
720 | virtio-net device's virtual rings and packet buffers mapping the VM's | |
721 | physical memory on hugetlbfs. To enable vhost-ports to map the VM's | |
7d1ced01 | 722 | memory into their process address space, pass the following parameters |
58397e6c KT |
723 | to QEMU: |
724 | ||
725 | `-object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages, | |
726 | share=on -numa node,memdev=mem -mem-prealloc` | |
727 | ||
7d1ced01 CL |
728 | Note: For use with an earlier QEMU version such as v1.6.2, use the |
729 | following to configure hugepages instead: | |
58397e6c | 730 | |
7d1ced01 | 731 | `-mem-path /dev/hugepages -mem-prealloc` |
58397e6c | 732 | |
7d1ced01 CL |
733 | DPDK vhost-cuse VM configuration with QEMU wrapper: |
734 | --------------------------------------------------- | |
58397e6c KT |
735 | The QEMU wrapper script automatically detects and calls QEMU with the |
736 | necessary parameters. It performs the following actions: | |
737 | ||
738 | * Automatically detects the location of the hugetlbfs and inserts this | |
739 | into the command line parameters. | |
740 | * Automatically open file descriptors for each virtio-net device and | |
741 | inserts this into the command line parameters. | |
742 | * Calls QEMU passing both the command line parameters passed to the | |
743 | script itself and those it has auto-detected. | |
744 | ||
745 | Before use, you **must** edit the configuration parameters section of the | |
746 | script to point to the correct emulator location and set additional | |
747 | settings. Of these settings, `emul_path` and `us_vhost_path` **must** be | |
748 | set. All other settings are optional. | |
749 | ||
750 | To use directly from the command line simply pass the wrapper some of the | |
751 | QEMU parameters: it will configure the rest. For example: | |
752 | ||
753 | ``` | |
754 | qemu-wrap.py -cpu host -boot c -hda <disk image> -m 4096 -smp 4 | |
755 | --enable-kvm -nographic -vnc none -net none -netdev tap,id=net1, | |
756 | script=no,downscript=no,ifname=if1,vhost=on -device virtio-net-pci, | |
757 | netdev=net1,mac=00:00:00:00:00:01 | |
5568661c | 758 | ``` |
58397e6c | 759 | |
7d1ced01 CL |
760 | DPDK vhost-cuse VM configuration with libvirt: |
761 | ---------------------------------------------- | |
58397e6c KT |
762 | |
763 | If you are using libvirt, you must enable libvirt to access the character | |
764 | device by adding it to controllers cgroup for libvirtd using the following | |
765 | steps. | |
766 | ||
767 | 1. In `/etc/libvirt/qemu.conf` add/edit the following lines: | |
768 | ||
769 | ``` | |
770 | 1) clear_emulator_capabilities = 0 | |
771 | 2) user = "root" | |
772 | 3) group = "root" | |
773 | 4) cgroup_device_acl = [ | |
774 | "/dev/null", "/dev/full", "/dev/zero", | |
775 | "/dev/random", "/dev/urandom", | |
776 | "/dev/ptmx", "/dev/kvm", "/dev/kqemu", | |
777 | "/dev/rtc", "/dev/hpet", "/dev/net/tun", | |
778 | "/dev/<my-vhost-device>", | |
779 | "/dev/hugepages"] | |
780 | ``` | |
781 | ||
782 | <my-vhost-device> refers to "vhost-net" if using the `/dev/vhost-net` | |
783 | device. If you have specificed a different name on the ovs-vswitchd | |
784 | commandline using the "--cuse_dev_name" parameter, please specify that | |
785 | filename instead. | |
786 | ||
787 | 2. Disable SELinux or set to permissive mode | |
788 | ||
789 | 3. Restart the libvirtd process | |
790 | For example, on Fedora: | |
791 | ||
792 | `systemctl restart libvirtd.service` | |
793 | ||
794 | After successfully editing the configuration, you may launch your | |
795 | vhost-enabled VM. The XML describing the VM can be configured like so | |
796 | within the <qemu:commandline> section: | |
797 | ||
798 | 1. Set up shared hugepages: | |
799 | ||
800 | ``` | |
801 | <qemu:arg value='-object'/> | |
802 | <qemu:arg value='memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on'/> | |
803 | <qemu:arg value='-numa'/> | |
804 | <qemu:arg value='node,memdev=mem'/> | |
805 | <qemu:arg value='-mem-prealloc'/> | |
806 | ``` | |
807 | ||
808 | 2. Set up your tap devices: | |
809 | ||
810 | ``` | |
811 | <qemu:arg value='-netdev'/> | |
812 | <qemu:arg value='type=tap,id=net1,script=no,downscript=no,ifname=vhost0,vhost=on'/> | |
813 | <qemu:arg value='-device'/> | |
814 | <qemu:arg value='virtio-net-pci,netdev=net1,mac=00:00:00:00:00:01'/> | |
815 | ``` | |
816 | ||
817 | Repeat for as many devices as are desired, modifying the id, ifname | |
818 | and mac as necessary. | |
819 | ||
820 | Again, if you are using an alternative character device (other than | |
821 | `/dev/vhost-net`), please specify the file descriptor like so: | |
822 | ||
823 | `<qemu:arg value='type=tap,id=net3,script=no,downscript=no,ifname=vhost0,vhost=on,vhostfd=<open_fd>'/>` | |
824 | ||
825 | Where <open_fd> refers to the open file descriptor of the character device. | |
826 | Instructions of how to retrieve the file descriptor can be found in the | |
827 | "DPDK vhost VM configuration" section. | |
828 | Alternatively, the process is automated with the qemu-wrap.py script, | |
829 | detailed in the next section. | |
830 | ||
831 | Now you may launch your VM using virt-manager, or like so: | |
832 | ||
833 | `virsh create my_vhost_vm.xml` | |
834 | ||
7d1ced01 | 835 | DPDK vhost-cuse VM configuration with libvirt and QEMU wrapper: |
58397e6c KT |
836 | ---------------------------------------------------------- |
837 | ||
838 | To use the qemu-wrapper script in conjuntion with libvirt, follow the | |
839 | steps in the previous section before proceeding with the following steps: | |
840 | ||
841 | 1. Place `qemu-wrap.py` in libvirtd's binary search PATH ($PATH) | |
842 | Ideally in the same directory that the QEMU binary is located. | |
843 | ||
844 | 2. Ensure that the script has the same owner/group and file permissions | |
845 | as the QEMU binary. | |
846 | ||
847 | 3. Update the VM xml file using "virsh edit VM.xml" | |
848 | ||
849 | 1. Set the VM to use the launch script. | |
850 | Set the emulator path contained in the `<emulator><emulator/>` tags. | |
851 | For example, replace: | |
852 | ||
853 | `<emulator>/usr/bin/qemu-kvm<emulator/>` | |
854 | ||
855 | with: | |
856 | ||
857 | `<emulator>/usr/bin/qemu-wrap.py<emulator/>` | |
858 | ||
859 | 4. Edit the Configuration Parameters section of the script to point to | |
860 | the correct emulator location and set any additional options. If you are | |
861 | using a alternative character device name, please set "us_vhost_path" to the | |
862 | location of that device. The script will automatically detect and insert | |
7d1ced01 | 863 | the correct "vhostfd" value in the QEMU command line arguments. |
58397e6c KT |
864 | |
865 | 5. Use virt-manager to launch the VM | |
866 | ||
9899125a OS |
867 | Running ovs-vswitchd with DPDK backend inside a VM |
868 | -------------------------------------------------- | |
869 | ||
870 | Please note that additional configuration is required if you want to run | |
871 | ovs-vswitchd with DPDK backend inside a QEMU virtual machine. Ovs-vswitchd | |
872 | creates separate DPDK TX queues for each CPU core available. This operation | |
873 | fails inside QEMU virtual machine because, by default, VirtIO NIC provided | |
874 | to the guest is configured to support only single TX queue and single RX | |
875 | queue. To change this behavior, you need to turn on 'mq' (multiqueue) | |
876 | property of all virtio-net-pci devices emulated by QEMU and used by DPDK. | |
877 | You may do it manually (by changing QEMU command line) or, if you use Libvirt, | |
878 | by adding the following string: | |
879 | ||
880 | `<driver name='vhost' queues='N'/>` | |
881 | ||
882 | to <interface> sections of all network devices used by DPDK. Parameter 'N' | |
883 | determines how many queues can be used by the guest. | |
884 | ||
542cc9bb TG |
885 | Restrictions: |
886 | ------------- | |
887 | ||
542cc9bb TG |
888 | - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue. |
889 | - Currently DPDK port does not make use any offload functionality. | |
58397e6c | 890 | - DPDK-vHost support works with 1G huge pages. |
542cc9bb TG |
891 | |
892 | ivshmem: | |
3088fab7 MG |
893 | - If you run Open vSwitch with smaller page sizes (e.g. 2MB), you may be |
894 | unable to share any rings or mempools with a virtual machine. | |
895 | This is because the current implementation of ivshmem works by sharing | |
896 | a single 1GB huge page from the host operating system to any guest | |
897 | operating system through the Qemu ivshmem device. When using smaller | |
898 | page sizes, multiple pages may be required to hold the ring descriptors | |
899 | and buffer pools. The Qemu ivshmem device does not allow you to share | |
900 | multiple file descriptors to the guest operating system. However, if you | |
901 | want to share dpdkr rings with other processes on the host, you can do | |
902 | this with smaller page sizes. | |
542cc9bb | 903 | |
1e77bbe5 | 904 | Platform and Network Interface: |
49bbbdfd IS |
905 | - By default with DPDK 2.2, a maximum of 64 TX queues can be used with an |
906 | Intel XL710 Network Interface on a platform with more than 64 logical | |
907 | cores. If a user attempts to add an XL710 interface as a DPDK port type to | |
908 | a system as described above, an error will be reported that initialization | |
909 | failed for the 65th queue. OVS will then roll back to the previous | |
910 | successful queue initialization and use that value as the total number of | |
911 | TX queues available with queue locking. If a user wishes to use more than | |
912 | 64 queues and avoid locking, then the | |
913 | `CONFIG_RTE_LIBRTE_I40E_QUEUE_NUM_PER_PF` config parameter in DPDK must be | |
914 | increased to the desired number of queues. Both DPDK and OVS must be | |
915 | recompiled for this change to take effect. | |
1e77bbe5 | 916 | |
e73b7508 CL |
917 | vHost and QEMU v2.4.0+: |
918 | - For versions of QEMU v2.4.0 and later, it is currently not possible to | |
919 | unbind more than one dpdkvhostuser port from the guest kernel driver | |
920 | without causing the ovs-vswitchd process to crash. If this is a requirement | |
921 | for your use case, it is recommended either to use a version of QEMU | |
922 | between v2.2.0 and v2.3.1 (inclusive), or alternatively, to apply the | |
923 | following patch to DPDK and rebuild: | |
924 | http://dpdk.org/dev/patchwork/patch/7736/ | |
925 | This problem will likely be resolved in Open vSwitch at a later date, when | |
926 | the next release of DPDK (which includes the above patch) is available and | |
927 | integrated into OVS. | |
928 | ||
542cc9bb TG |
929 | Bug Reporting: |
930 | -------------- | |
931 | ||
932 | Please report problems to bugs@openvswitch.org. | |
9feb1017 TG |
933 | |
934 | [INSTALL.userspace.md]:INSTALL.userspace.md | |
935 | [INSTALL.md]:INSTALL.md | |
491c2ea3 | 936 | [DPDK Linux GSG]: http://www.dpdk.org/doc/guides/linux_gsg/build_dpdk.html#binding-and-unbinding-network-ports-to-from-the-igb-uioor-vfio-modules |
58397e6c | 937 | [DPDK Docs]: http://dpdk.org/doc |