]>
Commit | Line | Data |
---|---|---|
1 | Using Open vSwitch with DPDK | |
2 | ============================ | |
3 | ||
4 | Open vSwitch can use Intel(R) DPDK lib to operate entirely in | |
5 | userspace. This file explains how to install and use Open vSwitch in | |
6 | such a mode. | |
7 | ||
8 | The DPDK support of Open vSwitch is considered experimental. | |
9 | It has not been thoroughly tested. | |
10 | ||
11 | This version of Open vSwitch should be built manually with `configure` | |
12 | and `make`. | |
13 | ||
14 | OVS needs a system with 1GB hugepages support. | |
15 | ||
16 | Building and Installing: | |
17 | ------------------------ | |
18 | ||
19 | Required: DPDK 16.04 | |
20 | Optional (if building with vhost-cuse): `fuse`, `fuse-devel` (`libfuse-dev` | |
21 | on Debian/Ubuntu) | |
22 | ||
23 | 1. Configure build & install DPDK: | |
24 | 1. Set `$DPDK_DIR` | |
25 | ||
26 | ``` | |
27 | export DPDK_DIR=/usr/src/dpdk-16.04 | |
28 | cd $DPDK_DIR | |
29 | ``` | |
30 | ||
31 | 2. Then run `make install` to build and install the library. | |
32 | For default install without IVSHMEM: | |
33 | ||
34 | `make install T=x86_64-native-linuxapp-gcc DESTDIR=install` | |
35 | ||
36 | To include IVSHMEM (shared memory): | |
37 | ||
38 | `make install T=x86_64-ivshmem-linuxapp-gcc DESTDIR=install` | |
39 | ||
40 | For further details refer to http://dpdk.org/ | |
41 | ||
42 | 2. Configure & build the Linux kernel: | |
43 | ||
44 | Refer to intel-dpdk-getting-started-guide.pdf for understanding | |
45 | DPDK kernel requirement. | |
46 | ||
47 | 3. Configure & build OVS: | |
48 | ||
49 | * Non IVSHMEM: | |
50 | ||
51 | `export DPDK_BUILD=$DPDK_DIR/x86_64-native-linuxapp-gcc/` | |
52 | ||
53 | * IVSHMEM: | |
54 | ||
55 | `export DPDK_BUILD=$DPDK_DIR/x86_64-ivshmem-linuxapp-gcc/` | |
56 | ||
57 | ``` | |
58 | cd $(OVS_DIR)/ | |
59 | ./boot.sh | |
60 | ./configure --with-dpdk=$DPDK_BUILD [CFLAGS="-g -O2 -Wno-cast-align"] | |
61 | make | |
62 | ``` | |
63 | ||
64 | Note: 'clang' users may specify the '-Wno-cast-align' flag to suppress DPDK cast-align warnings. | |
65 | ||
66 | To have better performance one can enable aggressive compiler optimizations and | |
67 | use the special instructions(popcnt, crc32) that may not be available on all | |
68 | machines. Instead of typing `make`, type: | |
69 | ||
70 | `make CFLAGS='-O3 -march=native'` | |
71 | ||
72 | Refer to [INSTALL.userspace.md] for general requirements of building userspace OVS. | |
73 | ||
74 | Using the DPDK with ovs-vswitchd: | |
75 | --------------------------------- | |
76 | ||
77 | 1. Setup system boot | |
78 | Add the following options to the kernel bootline: | |
79 | ||
80 | `default_hugepagesz=1GB hugepagesz=1G hugepages=1` | |
81 | ||
82 | 2. Setup DPDK devices: | |
83 | ||
84 | DPDK devices can be setup using either the VFIO (for DPDK 1.7+) or UIO | |
85 | modules. UIO requires inserting an out of tree driver igb_uio.ko that is | |
86 | available in DPDK. Setup for both methods are described below. | |
87 | ||
88 | * UIO: | |
89 | 1. insert uio.ko: `modprobe uio` | |
90 | 2. insert igb_uio.ko: `insmod $DPDK_BUILD/kmod/igb_uio.ko` | |
91 | 3. Bind network device to igb_uio: | |
92 | `$DPDK_DIR/tools/dpdk_nic_bind.py --bind=igb_uio eth1` | |
93 | ||
94 | * VFIO: | |
95 | ||
96 | VFIO needs to be supported in the kernel and the BIOS. More information | |
97 | can be found in the [DPDK Linux GSG]. | |
98 | ||
99 | 1. Insert vfio-pci.ko: `modprobe vfio-pci` | |
100 | 2. Set correct permissions on vfio device: `sudo /usr/bin/chmod a+x /dev/vfio` | |
101 | and: `sudo /usr/bin/chmod 0666 /dev/vfio/*` | |
102 | 3. Bind network device to vfio-pci: | |
103 | `$DPDK_DIR/tools/dpdk_nic_bind.py --bind=vfio-pci eth1` | |
104 | ||
105 | 3. Mount the hugetable filesystem | |
106 | ||
107 | `mount -t hugetlbfs -o pagesize=1G none /dev/hugepages` | |
108 | ||
109 | Ref to http://www.dpdk.org/doc/quick-start for verifying DPDK setup. | |
110 | ||
111 | 4. Follow the instructions in [INSTALL.md] to install only the | |
112 | userspace daemons and utilities (via 'make install'). | |
113 | 1. First time only db creation (or clearing): | |
114 | ||
115 | ``` | |
116 | mkdir -p /usr/local/etc/openvswitch | |
117 | mkdir -p /usr/local/var/run/openvswitch | |
118 | rm /usr/local/etc/openvswitch/conf.db | |
119 | ovsdb-tool create /usr/local/etc/openvswitch/conf.db \ | |
120 | /usr/local/share/openvswitch/vswitch.ovsschema | |
121 | ``` | |
122 | ||
123 | 2. Start ovsdb-server | |
124 | ||
125 | ``` | |
126 | ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock \ | |
127 | --remote=db:Open_vSwitch,Open_vSwitch,manager_options \ | |
128 | --private-key=db:Open_vSwitch,SSL,private_key \ | |
129 | --certificate=Open_vSwitch,SSL,certificate \ | |
130 | --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --pidfile --detach | |
131 | ``` | |
132 | ||
133 | 3. First time after db creation, initialize: | |
134 | ||
135 | ``` | |
136 | ovs-vsctl --no-wait init | |
137 | ``` | |
138 | ||
139 | 5. Start vswitchd: | |
140 | ||
141 | DPDK configuration arguments can be passed to vswitchd via Open_vSwitch | |
142 | other_config column. The recognized configuration options are listed. | |
143 | Defaults will be provided for all values not explicitly set. | |
144 | ||
145 | * dpdk-init | |
146 | Specifies whether OVS should initialize and support DPDK ports. This is | |
147 | a boolean, and defaults to false. | |
148 | ||
149 | * dpdk-lcore-mask | |
150 | Specifies the CPU cores on which dpdk lcore threads should be spawned. | |
151 | The DPDK lcore threads are used for DPDK library tasks, such as | |
152 | library internal message processing, logging, etc. Value should be in | |
153 | the form of a hex string (so '0x123') similar to the 'taskset' mask | |
154 | input. | |
155 | If not specified, the value will be determined by choosing the lowest | |
156 | CPU core from initial cpu affinity list. Otherwise, the value will be | |
157 | passed directly to the DPDK library. | |
158 | For performance reasons, it is best to set this to a single core on | |
159 | the system, rather than allow lcore threads to float. | |
160 | ||
161 | * dpdk-alloc-mem | |
162 | This sets the total memory to preallocate from hugepages regardless of | |
163 | processor socket. It is recommended to use dpdk-socket-mem instead. | |
164 | ||
165 | * dpdk-socket-mem | |
166 | Comma separated list of memory to pre-allocate from hugepages on specific | |
167 | sockets. | |
168 | ||
169 | * dpdk-hugepage-dir | |
170 | Directory where hugetlbfs is mounted | |
171 | ||
172 | * dpdk-extra | |
173 | Extra arguments to provide to DPDK EAL, as previously specified on the | |
174 | command line. Do not pass '--no-huge' to the system in this way. Support | |
175 | for running the system without hugepages is nonexistent. | |
176 | ||
177 | * cuse-dev-name | |
178 | Option to set the vhost_cuse character device name. | |
179 | ||
180 | * vhost-sock-dir | |
181 | Option to set the path to the vhost_user unix socket files. | |
182 | ||
183 | NOTE: Changing any of these options requires restarting the ovs-vswitchd | |
184 | application. | |
185 | ||
186 | Open vSwitch can be started as normal. DPDK will be initialized as long | |
187 | as the dpdk-init option has been set to 'true'. | |
188 | ||
189 | ||
190 | ``` | |
191 | export DB_SOCK=/usr/local/var/run/openvswitch/db.sock | |
192 | ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true | |
193 | ovs-vswitchd unix:$DB_SOCK --pidfile --detach | |
194 | ``` | |
195 | ||
196 | If allocated more than one GB hugepage (as for IVSHMEM), set amount and | |
197 | use NUMA node 0 memory: | |
198 | ||
199 | ``` | |
200 | ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,0" | |
201 | ovs-vswitchd unix:$DB_SOCK --pidfile --detach | |
202 | ``` | |
203 | ||
204 | 6. Add bridge & ports | |
205 | ||
206 | To use ovs-vswitchd with DPDK, create a bridge with datapath_type | |
207 | "netdev" in the configuration database. For example: | |
208 | ||
209 | `ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev` | |
210 | ||
211 | Now you can add dpdk devices. OVS expects DPDK device names to start with | |
212 | "dpdk" and end with a portid. vswitchd should print (in the log file) the | |
213 | number of dpdk devices found. | |
214 | ||
215 | ``` | |
216 | ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk | |
217 | ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk | |
218 | ``` | |
219 | ||
220 | Once first DPDK port is added to vswitchd, it creates a Polling thread and | |
221 | polls dpdk device in continuous loop. Therefore CPU utilization | |
222 | for that thread is always 100%. | |
223 | ||
224 | Note: creating bonds of DPDK interfaces is slightly different to creating | |
225 | bonds of system interfaces. For DPDK, the interface type must be explicitly | |
226 | set, for example: | |
227 | ||
228 | ``` | |
229 | ovs-vsctl add-bond br0 dpdkbond dpdk0 dpdk1 -- set Interface dpdk0 type=dpdk -- set Interface dpdk1 type=dpdk | |
230 | ``` | |
231 | ||
232 | 7. Add test flows | |
233 | ||
234 | Test flow script across NICs (assuming ovs in /usr/src/ovs): | |
235 | Execute script: | |
236 | ||
237 | ``` | |
238 | #! /bin/sh | |
239 | # Move to command directory | |
240 | cd /usr/src/ovs/utilities/ | |
241 | ||
242 | # Clear current flows | |
243 | ./ovs-ofctl del-flows br0 | |
244 | ||
245 | # Add flows between port 1 (dpdk0) to port 2 (dpdk1) | |
246 | ./ovs-ofctl add-flow br0 in_port=1,action=output:2 | |
247 | ./ovs-ofctl add-flow br0 in_port=2,action=output:1 | |
248 | ``` | |
249 | ||
250 | 8. QoS usage example | |
251 | ||
252 | Assuming you have a vhost-user port transmitting traffic consisting of | |
253 | packets of size 64 bytes, the following command would limit the egress | |
254 | transmission rate of the port to ~1,000,000 packets per second: | |
255 | ||
256 | `ovs-vsctl set port vhost-user0 qos=@newqos -- --id=@newqos create qos | |
257 | type=egress-policer other-config:cir=46000000 other-config:cbs=2048` | |
258 | ||
259 | To examine the QoS configuration of the port: | |
260 | ||
261 | `ovs-appctl -t ovs-vswitchd qos/show vhost-user0` | |
262 | ||
263 | To clear the QoS configuration from the port and ovsdb use the following: | |
264 | ||
265 | `ovs-vsctl destroy QoS vhost-user0 -- clear Port vhost-user0 qos` | |
266 | ||
267 | For more details regarding egress-policer parameters please refer to the | |
268 | vswitch.xml. | |
269 | ||
270 | 9. Ingress Policing Example | |
271 | ||
272 | Assuming you have a vhost-user port receiving traffic consisting of | |
273 | packets of size 64 bytes, the following command would limit the reception | |
274 | rate of the port to ~1,000,000 packets per second: | |
275 | ||
276 | `ovs-vsctl set interface vhost-user0 ingress_policing_rate=368000 | |
277 | ingress_policing_burst=1000` | |
278 | ||
279 | To examine the ingress policer configuration of the port: | |
280 | ||
281 | `ovs-vsctl list interface vhost-user0` | |
282 | ||
283 | To clear the ingress policer configuration from the port use the following: | |
284 | ||
285 | `ovs-vsctl set interface vhost-user0 ingress_policing_rate=0` | |
286 | ||
287 | For more details regarding ingress-policer see the vswitch.xml. | |
288 | ||
289 | Performance Tuning: | |
290 | ------------------- | |
291 | ||
292 | 1. PMD affinitization | |
293 | ||
294 | A poll mode driver (pmd) thread handles the I/O of all DPDK | |
295 | interfaces assigned to it. A pmd thread will busy loop through | |
296 | the assigned port/rxq's polling for packets, switch the packets | |
297 | and send to a tx port if required. Typically, it is found that | |
298 | a pmd thread is CPU bound, meaning that the greater the CPU | |
299 | occupancy the pmd thread can get, the better the performance. To | |
300 | that end, it is good practice to ensure that a pmd thread has as | |
301 | many cycles on a core available to it as possible. This can be | |
302 | achieved by affinitizing the pmd thread with a core that has no | |
303 | other workload. See section 7 below for a description of how to | |
304 | isolate cores for this purpose also. | |
305 | ||
306 | The following command can be used to specify the affinity of the | |
307 | pmd thread(s). | |
308 | ||
309 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=<hex string>` | |
310 | ||
311 | By setting a bit in the mask, a pmd thread is created and pinned | |
312 | to the corresponding CPU core. e.g. to run a pmd thread on core 1 | |
313 | ||
314 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=2` | |
315 | ||
316 | For more information, please refer to the Open_vSwitch TABLE section in | |
317 | ||
318 | `man ovs-vswitchd.conf.db` | |
319 | ||
320 | Note, that a pmd thread on a NUMA node is only created if there is | |
321 | at least one DPDK interface from that NUMA node added to OVS. | |
322 | ||
323 | 2. Multiple poll mode driver threads | |
324 | ||
325 | With pmd multi-threading support, OVS creates one pmd thread | |
326 | for each NUMA node by default. However, it can be seen that in cases | |
327 | where there are multiple ports/rxq's producing traffic, performance | |
328 | can be improved by creating multiple pmd threads running on separate | |
329 | cores. These pmd threads can then share the workload by each being | |
330 | responsible for different ports/rxq's. Assignment of ports/rxq's to | |
331 | pmd threads is done automatically. | |
332 | ||
333 | The following command can be used to specify the affinity of the | |
334 | pmd threads. | |
335 | ||
336 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=<hex string>` | |
337 | ||
338 | A set bit in the mask means a pmd thread is created and pinned | |
339 | to the corresponding CPU core. e.g. to run pmd threads on core 1 and 2 | |
340 | ||
341 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=6` | |
342 | ||
343 | For more information, please refer to the Open_vSwitch TABLE section in | |
344 | ||
345 | `man ovs-vswitchd.conf.db` | |
346 | ||
347 | For example, when using dpdk and dpdkvhostuser ports in a bi-directional | |
348 | VM loopback as shown below, spreading the workload over 2 or 4 pmd | |
349 | threads shows significant improvements as there will be more total CPU | |
350 | occupancy available. | |
351 | ||
352 | NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1 | |
353 | ||
354 | The following command can be used to confirm that the port/rxq assignment | |
355 | to pmd threads is as required: | |
356 | ||
357 | `ovs-appctl dpif-netdev/pmd-rxq-show` | |
358 | ||
359 | This can also be checked with: | |
360 | ||
361 | ``` | |
362 | top -H | |
363 | taskset -p <pid_of_pmd> | |
364 | ``` | |
365 | ||
366 | To understand where most of the pmd thread time is spent and whether the | |
367 | caches are being utilized, these commands can be used: | |
368 | ||
369 | ``` | |
370 | # Clear previous stats | |
371 | ovs-appctl dpif-netdev/pmd-stats-clear | |
372 | ||
373 | # Check current stats | |
374 | ovs-appctl dpif-netdev/pmd-stats-show | |
375 | ``` | |
376 | ||
377 | 3. DPDK port Rx Queues | |
378 | ||
379 | `ovs-vsctl set Interface <DPDK interface> options:n_rxq=<integer>` | |
380 | ||
381 | The command above sets the number of rx queues for DPDK interface. | |
382 | The rx queues are assigned to pmd threads on the same NUMA node in a | |
383 | round-robin fashion. For more information, please refer to the | |
384 | Open_vSwitch TABLE section in | |
385 | ||
386 | `man ovs-vswitchd.conf.db` | |
387 | ||
388 | 4. Exact Match Cache | |
389 | ||
390 | Each pmd thread contains one EMC. After initial flow setup in the | |
391 | datapath, the EMC contains a single table and provides the lowest level | |
392 | (fastest) switching for DPDK ports. If there is a miss in the EMC then | |
393 | the next level where switching will occur is the datapath classifier. | |
394 | Missing in the EMC and looking up in the datapath classifier incurs a | |
395 | significant performance penalty. If lookup misses occur in the EMC | |
396 | because it is too small to handle the number of flows, its size can | |
397 | be increased. The EMC size can be modified by editing the define | |
398 | EM_FLOW_HASH_SHIFT in lib/dpif-netdev.c. | |
399 | ||
400 | As mentioned above an EMC is per pmd thread. So an alternative way of | |
401 | increasing the aggregate amount of possible flow entries in EMC and | |
402 | avoiding datapath classifier lookups is to have multiple pmd threads | |
403 | running. This can be done as described in section 2. | |
404 | ||
405 | 5. Compiler options | |
406 | ||
407 | The default compiler optimization level is '-O2'. Changing this to | |
408 | more aggressive compiler optimizations such as '-O3' or | |
409 | '-Ofast -march=native' with gcc can produce performance gains. | |
410 | ||
411 | 6. Simultaneous Multithreading (SMT) | |
412 | ||
413 | With SMT enabled, one physical core appears as two logical cores | |
414 | which can improve performance. | |
415 | ||
416 | SMT can be utilized to add additional pmd threads without consuming | |
417 | additional physical cores. Additional pmd threads may be added in the | |
418 | same manner as described in section 2. If trying to minimize the use | |
419 | of physical cores for pmd threads, care must be taken to set the | |
420 | correct bits in the pmd-cpu-mask to ensure that the pmd threads are | |
421 | pinned to SMT siblings. | |
422 | ||
423 | For example, when using 2x 10 core processors in a dual socket system | |
424 | with HT enabled, /proc/cpuinfo will report 40 logical cores. To use | |
425 | two logical cores which share the same physical core for pmd threads, | |
426 | the following command can be used to identify a pair of logical cores. | |
427 | ||
428 | `cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list` | |
429 | ||
430 | where N is the logical core number. In this example, it would show that | |
431 | cores 1 and 21 share the same physical core. The pmd-cpu-mask to enable | |
432 | two pmd threads running on these two logical cores (one physical core) | |
433 | is. | |
434 | ||
435 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=100002` | |
436 | ||
437 | Note that SMT is enabled by the Hyper-Threading section in the | |
438 | BIOS, and as such will apply to the whole system. So the impact of | |
439 | enabling/disabling it for the whole system should be considered | |
440 | e.g. If workloads on the system can scale across multiple cores, | |
441 | SMT may very beneficial. However, if they do not and perform best | |
442 | on a single physical core, SMT may not be beneficial. | |
443 | ||
444 | 7. The isolcpus kernel boot parameter | |
445 | ||
446 | isolcpus can be used on the kernel bootline to isolate cores from the | |
447 | kernel scheduler and hence dedicate them to OVS or other packet | |
448 | forwarding related workloads. For example a Linux kernel boot-line | |
449 | could be: | |
450 | ||
451 | 'GRUB_CMDLINE_LINUX_DEFAULT="quiet hugepagesz=1G hugepages=4 default_hugepagesz=1G 'intel_iommu=off' isolcpus=1-19"' | |
452 | ||
453 | 8. NUMA/Cluster On Die | |
454 | ||
455 | Ideally inter NUMA datapaths should be avoided where possible as packets | |
456 | will go across QPI and there may be a slight performance penalty when | |
457 | compared with intra NUMA datapaths. On Intel Xeon Processor E5 v3, | |
458 | Cluster On Die is introduced on models that have 10 cores or more. | |
459 | This makes it possible to logically split a socket into two NUMA regions | |
460 | and again it is preferred where possible to keep critical datapaths | |
461 | within the one cluster. | |
462 | ||
463 | It is good practice to ensure that threads that are in the datapath are | |
464 | pinned to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs | |
465 | responsible for forwarding. | |
466 | ||
467 | 9. Rx Mergeable buffers | |
468 | ||
469 | Rx Mergeable buffers is a virtio feature that allows chaining of multiple | |
470 | virtio descriptors to handle large packet sizes. As such, large packets | |
471 | are handled by reserving and chaining multiple free descriptors | |
472 | together. Mergeable buffer support is negotiated between the virtio | |
473 | driver and virtio device and is supported by the DPDK vhost library. | |
474 | This behavior is typically supported and enabled by default, however | |
475 | in the case where the user knows that rx mergeable buffers are not needed | |
476 | i.e. jumbo frames are not needed, it can be forced off by adding | |
477 | mrg_rxbuf=off to the QEMU command line options. By not reserving multiple | |
478 | chains of descriptors it will make more individual virtio descriptors | |
479 | available for rx to the guest using dpdkvhost ports and this can improve | |
480 | performance. | |
481 | ||
482 | 10. Packet processing in the guest | |
483 | ||
484 | It is good practice whether simply forwarding packets from one | |
485 | interface to another or more complex packet processing in the guest, | |
486 | to ensure that the thread performing this work has as much CPU | |
487 | occupancy as possible. For example when the DPDK sample application | |
488 | `testpmd` is used to forward packets in the guest, multiple QEMU vCPU | |
489 | threads can be created. Taskset can then be used to affinitize the | |
490 | vCPU thread responsible for forwarding to a dedicated core not used | |
491 | for other general processing on the host system. | |
492 | ||
493 | 11. DPDK virtio pmd in the guest | |
494 | ||
495 | dpdkvhostcuse or dpdkvhostuser ports can be used to accelerate the path | |
496 | to the guest using the DPDK vhost library. This library is compatible with | |
497 | virtio-net drivers in the guest but significantly better performance can | |
498 | be observed when using the DPDK virtio pmd driver in the guest. The DPDK | |
499 | `testpmd` application can be used in the guest as an example application | |
500 | that forwards packet from one DPDK vhost port to another. An example of | |
501 | running `testpmd` in the guest can be seen here. | |
502 | ||
503 | `./testpmd -c 0x3 -n 4 --socket-mem 512 -- --burst=64 -i --txqflags=0xf00 --disable-hw-vlan --forward-mode=io --auto-start` | |
504 | ||
505 | See below information on dpdkvhostcuse and dpdkvhostuser ports. | |
506 | See [DPDK Docs] for more information on `testpmd`. | |
507 | ||
508 | ||
509 | ||
510 | DPDK Rings : | |
511 | ------------ | |
512 | ||
513 | Following the steps above to create a bridge, you can now add dpdk rings | |
514 | as a port to the vswitch. OVS will expect the DPDK ring device name to | |
515 | start with dpdkr and end with a portid. | |
516 | ||
517 | `ovs-vsctl add-port br0 dpdkr0 -- set Interface dpdkr0 type=dpdkr` | |
518 | ||
519 | DPDK rings client test application | |
520 | ||
521 | Included in the test directory is a sample DPDK application for testing | |
522 | the rings. This is from the base dpdk directory and modified to work | |
523 | with the ring naming used within ovs. | |
524 | ||
525 | location tests/ovs_client | |
526 | ||
527 | To run the client : | |
528 | ||
529 | ``` | |
530 | cd /usr/src/ovs/tests/ | |
531 | ovsclient -c 1 -n 4 --proc-type=secondary -- -n "port id you gave dpdkr" | |
532 | ``` | |
533 | ||
534 | In the case of the dpdkr example above the "port id you gave dpdkr" is 0. | |
535 | ||
536 | It is essential to have --proc-type=secondary | |
537 | ||
538 | The application simply receives an mbuf on the receive queue of the | |
539 | ethernet ring and then places that same mbuf on the transmit ring of | |
540 | the ethernet ring. It is a trivial loopback application. | |
541 | ||
542 | DPDK rings in VM (IVSHMEM shared memory communications) | |
543 | ------------------------------------------------------- | |
544 | ||
545 | In addition to executing the client in the host, you can execute it within | |
546 | a guest VM. To do so you will need a patched qemu. You can download the | |
547 | patch and getting started guide at : | |
548 | ||
549 | https://01.org/packet-processing/downloads | |
550 | ||
551 | A general rule of thumb for better performance is that the client | |
552 | application should not be assigned the same dpdk core mask "-c" as | |
553 | the vswitchd. | |
554 | ||
555 | DPDK vhost: | |
556 | ----------- | |
557 | ||
558 | DPDK 16.04 supports two types of vhost: | |
559 | ||
560 | 1. vhost-user | |
561 | 2. vhost-cuse | |
562 | ||
563 | Whatever type of vhost is enabled in the DPDK build specified, is the type | |
564 | that will be enabled in OVS. By default, vhost-user is enabled in DPDK. | |
565 | Therefore, unless vhost-cuse has been enabled in DPDK, vhost-user ports | |
566 | will be enabled in OVS. | |
567 | Please note that support for vhost-cuse is intended to be deprecated in OVS | |
568 | in a future release. | |
569 | ||
570 | DPDK vhost-user: | |
571 | ---------------- | |
572 | ||
573 | The following sections describe the use of vhost-user 'dpdkvhostuser' ports | |
574 | with OVS. | |
575 | ||
576 | DPDK vhost-user Prerequisites: | |
577 | ------------------------- | |
578 | ||
579 | 1. DPDK 16.04 with vhost support enabled as documented in the "Building and | |
580 | Installing section" | |
581 | ||
582 | 2. QEMU version v2.1.0+ | |
583 | ||
584 | QEMU v2.1.0 will suffice, but it is recommended to use v2.2.0 if providing | |
585 | your VM with memory greater than 1GB due to potential issues with memory | |
586 | mapping larger areas. | |
587 | ||
588 | Adding DPDK vhost-user ports to the Switch: | |
589 | -------------------------------------- | |
590 | ||
591 | Following the steps above to create a bridge, you can now add DPDK vhost-user | |
592 | as a port to the vswitch. Unlike DPDK ring ports, DPDK vhost-user ports can | |
593 | have arbitrary names, except that forward and backward slashes are prohibited | |
594 | in the names. | |
595 | ||
596 | - For vhost-user, the name of the port type is `dpdkvhostuser` | |
597 | ||
598 | ``` | |
599 | ovs-vsctl add-port br0 vhost-user-1 -- set Interface vhost-user-1 | |
600 | type=dpdkvhostuser | |
601 | ``` | |
602 | ||
603 | This action creates a socket located at | |
604 | `/usr/local/var/run/openvswitch/vhost-user-1`, which you must provide | |
605 | to your VM on the QEMU command line. More instructions on this can be | |
606 | found in the next section "DPDK vhost-user VM configuration" | |
607 | - If you wish for the vhost-user sockets to be created in a sub-directory of | |
608 | `/usr/local/var/run/openvswitch`, you may specify this directory in the | |
609 | ovsdb like so: | |
610 | ||
611 | `./utilities/ovs-vsctl --no-wait \ | |
612 | set Open_vSwitch . other_config:vhost-sock-dir=subdir` | |
613 | ||
614 | DPDK vhost-user VM configuration: | |
615 | --------------------------------- | |
616 | Follow the steps below to attach vhost-user port(s) to a VM. | |
617 | ||
618 | 1. Configure sockets. | |
619 | Pass the following parameters to QEMU to attach a vhost-user device: | |
620 | ||
621 | ``` | |
622 | -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 | |
623 | -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce | |
624 | -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 | |
625 | ``` | |
626 | ||
627 | ...where vhost-user-1 is the name of the vhost-user port added | |
628 | to the switch. | |
629 | Repeat the above parameters for multiple devices, changing the | |
630 | chardev path and id as necessary. Note that a separate and different | |
631 | chardev path needs to be specified for each vhost-user device. For | |
632 | example you have a second vhost-user port named 'vhost-user-2', you | |
633 | append your QEMU command line with an additional set of parameters: | |
634 | ||
635 | ``` | |
636 | -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 | |
637 | -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce | |
638 | -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2 | |
639 | ``` | |
640 | ||
641 | 2. Configure huge pages. | |
642 | QEMU must allocate the VM's memory on hugetlbfs. vhost-user ports access | |
643 | a virtio-net device's virtual rings and packet buffers mapping the VM's | |
644 | physical memory on hugetlbfs. To enable vhost-user ports to map the VM's | |
645 | memory into their process address space, pass the following paramters | |
646 | to QEMU: | |
647 | ||
648 | ``` | |
649 | -object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages, | |
650 | share=on | |
651 | -numa node,memdev=mem -mem-prealloc | |
652 | ``` | |
653 | ||
654 | 3. Optional: Enable multiqueue support | |
655 | The vhost-user interface must be configured in Open vSwitch with the | |
656 | desired amount of queues with: | |
657 | ||
658 | ``` | |
659 | ovs-vsctl set Interface vhost-user-2 options:n_rxq=<requested queues> | |
660 | ``` | |
661 | ||
662 | QEMU needs to be configured as well. | |
663 | The $q below should match the queues requested in OVS (if $q is more, | |
664 | packets will not be received). | |
665 | The $v is the number of vectors, which is '$q x 2 + 2'. | |
666 | ||
667 | ``` | |
668 | -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 | |
669 | -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=$q | |
670 | -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=$v | |
671 | ``` | |
672 | ||
673 | If one wishes to use multiple queues for an interface in the guest, the | |
674 | driver in the guest operating system must be configured to do so. It is | |
675 | recommended that the number of queues configured be equal to '$q'. | |
676 | ||
677 | For example, this can be done for the Linux kernel virtio-net driver with: | |
678 | ||
679 | ``` | |
680 | ethtool -L <DEV> combined <$q> | |
681 | ``` | |
682 | ||
683 | A note on the command above: | |
684 | ||
685 | `-L`: Changes the numbers of channels of the specified network device | |
686 | ||
687 | `combined`: Changes the number of multi-purpose channels. | |
688 | ||
689 | DPDK vhost-cuse: | |
690 | ---------------- | |
691 | ||
692 | The following sections describe the use of vhost-cuse 'dpdkvhostcuse' ports | |
693 | with OVS. | |
694 | ||
695 | DPDK vhost-cuse Prerequisites: | |
696 | ------------------------- | |
697 | ||
698 | 1. DPDK 16.04 with vhost support enabled as documented in the "Building and | |
699 | Installing section" | |
700 | As an additional step, you must enable vhost-cuse in DPDK by setting the | |
701 | following additional flag in `config/common_base`: | |
702 | ||
703 | `CONFIG_RTE_LIBRTE_VHOST_USER=n` | |
704 | ||
705 | Following this, rebuild DPDK as per the instructions in the "Building and | |
706 | Installing" section. Finally, rebuild OVS as per step 3 in the "Building | |
707 | and Installing" section - OVS will detect that DPDK has vhost-cuse libraries | |
708 | compiled and in turn will enable support for it in the switch and disable | |
709 | vhost-user support. | |
710 | ||
711 | 2. Insert the Cuse module: | |
712 | ||
713 | `modprobe cuse` | |
714 | ||
715 | 3. Build and insert the `eventfd_link` module: | |
716 | ||
717 | ``` | |
718 | cd $DPDK_DIR/lib/librte_vhost/eventfd_link/ | |
719 | make | |
720 | insmod $DPDK_DIR/lib/librte_vhost/eventfd_link.ko | |
721 | ``` | |
722 | ||
723 | 4. QEMU version v2.1.0+ | |
724 | ||
725 | vhost-cuse will work with QEMU v2.1.0 and above, however it is recommended to | |
726 | use v2.2.0 if providing your VM with memory greater than 1GB due to potential | |
727 | issues with memory mapping larger areas. | |
728 | Note: QEMU v1.6.2 will also work, with slightly different command line parameters, | |
729 | which are specified later in this document. | |
730 | ||
731 | Adding DPDK vhost-cuse ports to the Switch: | |
732 | -------------------------------------- | |
733 | ||
734 | Following the steps above to create a bridge, you can now add DPDK vhost-cuse | |
735 | as a port to the vswitch. Unlike DPDK ring ports, DPDK vhost-cuse ports can have | |
736 | arbitrary names. | |
737 | ||
738 | - For vhost-cuse, the name of the port type is `dpdkvhostcuse` | |
739 | ||
740 | ``` | |
741 | ovs-vsctl add-port br0 vhost-cuse-1 -- set Interface vhost-cuse-1 | |
742 | type=dpdkvhostcuse | |
743 | ``` | |
744 | ||
745 | When attaching vhost-cuse ports to QEMU, the name provided during the | |
746 | add-port operation must match the ifname parameter on the QEMU command | |
747 | line. More instructions on this can be found in the next section. | |
748 | ||
749 | DPDK vhost-cuse VM configuration: | |
750 | --------------------------------- | |
751 | ||
752 | vhost-cuse ports use a Linux* character device to communicate with QEMU. | |
753 | By default it is set to `/dev/vhost-net`. It is possible to reuse this | |
754 | standard device for DPDK vhost, which makes setup a little simpler but it | |
755 | is better practice to specify an alternative character device in order to | |
756 | avoid any conflicts if kernel vhost is to be used in parallel. | |
757 | ||
758 | 1. This step is only needed if using an alternative character device. | |
759 | ||
760 | The new character device filename must be specified in the ovsdb: | |
761 | ||
762 | `./utilities/ovs-vsctl --no-wait set Open_vSwitch . \ | |
763 | other_config:cuse-dev-name=my-vhost-net` | |
764 | ||
765 | In the example above, the character device to be used will be | |
766 | `/dev/my-vhost-net`. | |
767 | ||
768 | 2. This step is only needed if reusing the standard character device. It will | |
769 | conflict with the kernel vhost character device so the user must first | |
770 | remove it. | |
771 | ||
772 | `rm -rf /dev/vhost-net` | |
773 | ||
774 | 3a. Configure virtio-net adaptors: | |
775 | The following parameters must be passed to the QEMU binary: | |
776 | ||
777 | ``` | |
778 | -netdev tap,id=<id>,script=no,downscript=no,ifname=<name>,vhost=on | |
779 | -device virtio-net-pci,netdev=net1,mac=<mac> | |
780 | ``` | |
781 | ||
782 | Repeat the above parameters for multiple devices. | |
783 | ||
784 | The DPDK vhost library will negiotiate its own features, so they | |
785 | need not be passed in as command line params. Note that as offloads are | |
786 | disabled this is the equivalent of setting: | |
787 | ||
788 | `csum=off,gso=off,guest_tso4=off,guest_tso6=off,guest_ecn=off` | |
789 | ||
790 | 3b. If using an alternative character device. It must be also explicitly | |
791 | passed to QEMU using the `vhostfd` argument: | |
792 | ||
793 | ``` | |
794 | -netdev tap,id=<id>,script=no,downscript=no,ifname=<name>,vhost=on, | |
795 | vhostfd=<open_fd> | |
796 | -device virtio-net-pci,netdev=net1,mac=<mac> | |
797 | ``` | |
798 | ||
799 | The open file descriptor must be passed to QEMU running as a child | |
800 | process. This could be done with a simple python script. | |
801 | ||
802 | ``` | |
803 | #!/usr/bin/python | |
804 | fd = os.open("/dev/usvhost", os.O_RDWR) | |
805 | subprocess.call("qemu-system-x86_64 .... -netdev tap,id=vhostnet0,\ | |
806 | vhost=on,vhostfd=" + fd +"...", shell=True) | |
807 | ||
808 | Alternatively the `qemu-wrap.py` script can be used to automate the | |
809 | requirements specified above and can be used in conjunction with libvirt if | |
810 | desired. See the "DPDK vhost VM configuration with QEMU wrapper" section | |
811 | below. | |
812 | ||
813 | 4. Configure huge pages: | |
814 | QEMU must allocate the VM's memory on hugetlbfs. Vhost ports access a | |
815 | virtio-net device's virtual rings and packet buffers mapping the VM's | |
816 | physical memory on hugetlbfs. To enable vhost-ports to map the VM's | |
817 | memory into their process address space, pass the following parameters | |
818 | to QEMU: | |
819 | ||
820 | `-object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages, | |
821 | share=on -numa node,memdev=mem -mem-prealloc` | |
822 | ||
823 | Note: For use with an earlier QEMU version such as v1.6.2, use the | |
824 | following to configure hugepages instead: | |
825 | ||
826 | `-mem-path /dev/hugepages -mem-prealloc` | |
827 | ||
828 | DPDK vhost-cuse VM configuration with QEMU wrapper: | |
829 | --------------------------------------------------- | |
830 | The QEMU wrapper script automatically detects and calls QEMU with the | |
831 | necessary parameters. It performs the following actions: | |
832 | ||
833 | * Automatically detects the location of the hugetlbfs and inserts this | |
834 | into the command line parameters. | |
835 | * Automatically open file descriptors for each virtio-net device and | |
836 | inserts this into the command line parameters. | |
837 | * Calls QEMU passing both the command line parameters passed to the | |
838 | script itself and those it has auto-detected. | |
839 | ||
840 | Before use, you **must** edit the configuration parameters section of the | |
841 | script to point to the correct emulator location and set additional | |
842 | settings. Of these settings, `emul_path` and `us_vhost_path` **must** be | |
843 | set. All other settings are optional. | |
844 | ||
845 | To use directly from the command line simply pass the wrapper some of the | |
846 | QEMU parameters: it will configure the rest. For example: | |
847 | ||
848 | ``` | |
849 | qemu-wrap.py -cpu host -boot c -hda <disk image> -m 4096 -smp 4 | |
850 | --enable-kvm -nographic -vnc none -net none -netdev tap,id=net1, | |
851 | script=no,downscript=no,ifname=if1,vhost=on -device virtio-net-pci, | |
852 | netdev=net1,mac=00:00:00:00:00:01 | |
853 | ``` | |
854 | ||
855 | DPDK vhost-cuse VM configuration with libvirt: | |
856 | ---------------------------------------------- | |
857 | ||
858 | If you are using libvirt, you must enable libvirt to access the character | |
859 | device by adding it to controllers cgroup for libvirtd using the following | |
860 | steps. | |
861 | ||
862 | 1. In `/etc/libvirt/qemu.conf` add/edit the following lines: | |
863 | ||
864 | ``` | |
865 | 1) clear_emulator_capabilities = 0 | |
866 | 2) user = "root" | |
867 | 3) group = "root" | |
868 | 4) cgroup_device_acl = [ | |
869 | "/dev/null", "/dev/full", "/dev/zero", | |
870 | "/dev/random", "/dev/urandom", | |
871 | "/dev/ptmx", "/dev/kvm", "/dev/kqemu", | |
872 | "/dev/rtc", "/dev/hpet", "/dev/net/tun", | |
873 | "/dev/<my-vhost-device>", | |
874 | "/dev/hugepages"] | |
875 | ``` | |
876 | ||
877 | <my-vhost-device> refers to "vhost-net" if using the `/dev/vhost-net` | |
878 | device. If you have specificed a different name in the database | |
879 | using the "other_config:cuse-dev-name" parameter, please specify that | |
880 | filename instead. | |
881 | ||
882 | 2. Disable SELinux or set to permissive mode | |
883 | ||
884 | 3. Restart the libvirtd process | |
885 | For example, on Fedora: | |
886 | ||
887 | `systemctl restart libvirtd.service` | |
888 | ||
889 | After successfully editing the configuration, you may launch your | |
890 | vhost-enabled VM. The XML describing the VM can be configured like so | |
891 | within the <qemu:commandline> section: | |
892 | ||
893 | 1. Set up shared hugepages: | |
894 | ||
895 | ``` | |
896 | <qemu:arg value='-object'/> | |
897 | <qemu:arg value='memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on'/> | |
898 | <qemu:arg value='-numa'/> | |
899 | <qemu:arg value='node,memdev=mem'/> | |
900 | <qemu:arg value='-mem-prealloc'/> | |
901 | ``` | |
902 | ||
903 | 2. Set up your tap devices: | |
904 | ||
905 | ``` | |
906 | <qemu:arg value='-netdev'/> | |
907 | <qemu:arg value='type=tap,id=net1,script=no,downscript=no,ifname=vhost0,vhost=on'/> | |
908 | <qemu:arg value='-device'/> | |
909 | <qemu:arg value='virtio-net-pci,netdev=net1,mac=00:00:00:00:00:01'/> | |
910 | ``` | |
911 | ||
912 | Repeat for as many devices as are desired, modifying the id, ifname | |
913 | and mac as necessary. | |
914 | ||
915 | Again, if you are using an alternative character device (other than | |
916 | `/dev/vhost-net`), please specify the file descriptor like so: | |
917 | ||
918 | `<qemu:arg value='type=tap,id=net3,script=no,downscript=no,ifname=vhost0,vhost=on,vhostfd=<open_fd>'/>` | |
919 | ||
920 | Where <open_fd> refers to the open file descriptor of the character device. | |
921 | Instructions of how to retrieve the file descriptor can be found in the | |
922 | "DPDK vhost VM configuration" section. | |
923 | Alternatively, the process is automated with the qemu-wrap.py script, | |
924 | detailed in the next section. | |
925 | ||
926 | Now you may launch your VM using virt-manager, or like so: | |
927 | ||
928 | `virsh create my_vhost_vm.xml` | |
929 | ||
930 | DPDK vhost-cuse VM configuration with libvirt and QEMU wrapper: | |
931 | ---------------------------------------------------------- | |
932 | ||
933 | To use the qemu-wrapper script in conjuntion with libvirt, follow the | |
934 | steps in the previous section before proceeding with the following steps: | |
935 | ||
936 | 1. Place `qemu-wrap.py` in libvirtd's binary search PATH ($PATH) | |
937 | Ideally in the same directory that the QEMU binary is located. | |
938 | ||
939 | 2. Ensure that the script has the same owner/group and file permissions | |
940 | as the QEMU binary. | |
941 | ||
942 | 3. Update the VM xml file using "virsh edit VM.xml" | |
943 | ||
944 | 1. Set the VM to use the launch script. | |
945 | Set the emulator path contained in the `<emulator><emulator/>` tags. | |
946 | For example, replace: | |
947 | ||
948 | `<emulator>/usr/bin/qemu-kvm<emulator/>` | |
949 | ||
950 | with: | |
951 | ||
952 | `<emulator>/usr/bin/qemu-wrap.py<emulator/>` | |
953 | ||
954 | 4. Edit the Configuration Parameters section of the script to point to | |
955 | the correct emulator location and set any additional options. If you are | |
956 | using a alternative character device name, please set "us_vhost_path" to the | |
957 | location of that device. The script will automatically detect and insert | |
958 | the correct "vhostfd" value in the QEMU command line arguments. | |
959 | ||
960 | 5. Use virt-manager to launch the VM | |
961 | ||
962 | Running ovs-vswitchd with DPDK backend inside a VM | |
963 | -------------------------------------------------- | |
964 | ||
965 | Please note that additional configuration is required if you want to run | |
966 | ovs-vswitchd with DPDK backend inside a QEMU virtual machine. Ovs-vswitchd | |
967 | creates separate DPDK TX queues for each CPU core available. This operation | |
968 | fails inside QEMU virtual machine because, by default, VirtIO NIC provided | |
969 | to the guest is configured to support only single TX queue and single RX | |
970 | queue. To change this behavior, you need to turn on 'mq' (multiqueue) | |
971 | property of all virtio-net-pci devices emulated by QEMU and used by DPDK. | |
972 | You may do it manually (by changing QEMU command line) or, if you use Libvirt, | |
973 | by adding the following string: | |
974 | ||
975 | `<driver name='vhost' queues='N'/>` | |
976 | ||
977 | to <interface> sections of all network devices used by DPDK. Parameter 'N' | |
978 | determines how many queues can be used by the guest. | |
979 | ||
980 | Restrictions: | |
981 | ------------- | |
982 | ||
983 | - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue. | |
984 | - Currently DPDK port does not make use any offload functionality. | |
985 | - DPDK-vHost support works with 1G huge pages. | |
986 | ||
987 | ivshmem: | |
988 | - If you run Open vSwitch with smaller page sizes (e.g. 2MB), you may be | |
989 | unable to share any rings or mempools with a virtual machine. | |
990 | This is because the current implementation of ivshmem works by sharing | |
991 | a single 1GB huge page from the host operating system to any guest | |
992 | operating system through the Qemu ivshmem device. When using smaller | |
993 | page sizes, multiple pages may be required to hold the ring descriptors | |
994 | and buffer pools. The Qemu ivshmem device does not allow you to share | |
995 | multiple file descriptors to the guest operating system. However, if you | |
996 | want to share dpdkr rings with other processes on the host, you can do | |
997 | this with smaller page sizes. | |
998 | ||
999 | Platform and Network Interface: | |
1000 | - By default with DPDK 16.04, a maximum of 64 TX queues can be used with an | |
1001 | Intel XL710 Network Interface on a platform with more than 64 logical | |
1002 | cores. If a user attempts to add an XL710 interface as a DPDK port type to | |
1003 | a system as described above, an error will be reported that initialization | |
1004 | failed for the 65th queue. OVS will then roll back to the previous | |
1005 | successful queue initialization and use that value as the total number of | |
1006 | TX queues available with queue locking. If a user wishes to use more than | |
1007 | 64 queues and avoid locking, then the | |
1008 | `CONFIG_RTE_LIBRTE_I40E_QUEUE_NUM_PER_PF` config parameter in DPDK must be | |
1009 | increased to the desired number of queues. Both DPDK and OVS must be | |
1010 | recompiled for this change to take effect. | |
1011 | ||
1012 | Bug Reporting: | |
1013 | -------------- | |
1014 | ||
1015 | Please report problems to bugs@openvswitch.org. | |
1016 | ||
1017 | [INSTALL.userspace.md]:INSTALL.userspace.md | |
1018 | [INSTALL.md]:INSTALL.md | |
1019 | [DPDK Linux GSG]: http://www.dpdk.org/doc/guides/linux_gsg/build_dpdk.html#binding-and-unbinding-network-ports-to-from-the-igb-uioor-vfio-modules | |
1020 | [DPDK Docs]: http://dpdk.org/doc |