]>
Commit | Line | Data |
---|---|---|
1 | OVS DPDK ADVANCED INSTALL GUIDE | |
2 | ================================= | |
3 | ||
4 | ## Contents | |
5 | ||
6 | 1. [Overview](#overview) | |
7 | 2. [Building Shared Library](#build) | |
8 | 3. [System configuration](#sysconf) | |
9 | 4. [Performance Tuning](#perftune) | |
10 | 5. [OVS Testcases](#ovstc) | |
11 | 6. [Vhost Walkthrough](#vhost) | |
12 | 7. [QOS](#qos) | |
13 | 8. [Rate Limiting](#rl) | |
14 | 9. [Flow Control](#fc) | |
15 | 10. [Vsperf](#vsperf) | |
16 | ||
17 | ## <a name="overview"></a> 1. Overview | |
18 | ||
19 | The Advanced Install Guide explains how to improve OVS performance using | |
20 | DPDK datapath. This guide also provides information on tuning, system configuration, | |
21 | troubleshooting, static code analysis and testcases. | |
22 | ||
23 | ## <a name="build"></a> 2. Building Shared Library | |
24 | ||
25 | DPDK can be built as static or shared library and shall be linked by applications | |
26 | using DPDK datapath. The section lists steps to build shared library and dynamically | |
27 | link DPDK against OVS. | |
28 | ||
29 | Note: Minor performance loss is seen with OVS when using shared DPDK library as | |
30 | compared to static library. | |
31 | ||
32 | Check section [INSTALL DPDK], [INSTALL OVS] of INSTALL.DPDK on download instructions | |
33 | for DPDK and OVS. | |
34 | ||
35 | * Configure the DPDK library | |
36 | ||
37 | Set `CONFIG_RTE_BUILD_SHARED_LIB=y` in `config/common_base` | |
38 | to generate shared DPDK library | |
39 | ||
40 | ||
41 | * Build and install DPDK | |
42 | ||
43 | For Default install (without IVSHMEM), set `export DPDK_TARGET=x86_64-native-linuxapp-gcc` | |
44 | For IVSHMEM case, set `export DPDK_TARGET=x86_64-ivshmem-linuxapp-gcc` | |
45 | ||
46 | ``` | |
47 | export DPDK_DIR=/usr/src/dpdk-16.07 | |
48 | export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET | |
49 | make install T=$DPDK_TARGET DESTDIR=install | |
50 | ``` | |
51 | ||
52 | * Build, Install and Setup OVS. | |
53 | ||
54 | Export the DPDK shared library location and setup OVS as listed in | |
55 | section 3.3 of INSTALL.DPDK. | |
56 | ||
57 | `export LD_LIBRARY_PATH=$DPDK_DIR/x86_64-native-linuxapp-gcc/lib` | |
58 | ||
59 | ## <a name="sysconf"></a> 3. System Configuration | |
60 | ||
61 | To achieve optimal OVS performance, the system can be configured and that includes | |
62 | BIOS tweaks, Grub cmdline additions, better understanding of NUMA nodes and | |
63 | apt selection of PCIe slots for NIC placement. | |
64 | ||
65 | ### 3.1 Recommended BIOS settings | |
66 | ||
67 | ``` | |
68 | | Settings | values | comments | |
69 | |---------------------------|-----------|----------- | |
70 | | C3 power state | Disabled | - | |
71 | | C6 power state | Disabled | - | |
72 | | MLC Streamer | Enabled | - | |
73 | | MLC Spacial prefetcher | Enabled | - | |
74 | | DCU Data prefetcher | Enabled | - | |
75 | | DCA | Enabled | - | |
76 | | CPU power and performance | Performance - | |
77 | | Memory RAS and perf | | - | |
78 | config-> NUMA optimized | Enabled | - | |
79 | ``` | |
80 | ||
81 | ### 3.2 PCIe Slot Selection | |
82 | ||
83 | The fastpath performance also depends on factors like the NIC placement, | |
84 | Channel speeds between PCIe slot and CPU, proximity of PCIe slot to the CPU | |
85 | cores running DPDK application. Listed below are the steps to identify | |
86 | right PCIe slot. | |
87 | ||
88 | - Retrieve host details using cmd `dmidecode -t baseboard | grep "Product Name"` | |
89 | - Download the technical specification for Product listed eg: S2600WT2. | |
90 | - Check the Product Architecture Overview on the Riser slot placement, | |
91 | CPU sharing info and also PCIe channel speeds. | |
92 | ||
93 | example: On S2600WT, CPU1 and CPU2 share Riser Slot 1 with Channel speed between | |
94 | CPU1 and Riser Slot1 at 32GB/s, CPU2 and Riser Slot1 at 16GB/s. Running DPDK app | |
95 | on CPU1 cores and NIC inserted in to Riser card Slots will optimize OVS performance | |
96 | in this case. | |
97 | ||
98 | - Check the Riser Card #1 - Root Port mapping information, on the available slots | |
99 | and individual bus speeds. In S2600WT slot 1, slot 2 has high bus speeds and are | |
100 | potential slots for NIC placement. | |
101 | ||
102 | ### 3.3 Advanced Hugepage setup | |
103 | ||
104 | Allocate and mount 1G Huge pages: | |
105 | ||
106 | - For persistent allocation of huge pages, add the following options to the kernel bootline | |
107 | ||
108 | Add `default_hugepagesz=1GB hugepagesz=1G hugepages=N` | |
109 | ||
110 | For platforms supporting multiple huge page sizes, Add options | |
111 | ||
112 | `default_hugepagesz=<size> hugepagesz=<size> hugepages=N` | |
113 | where 'N' = Number of huge pages requested, 'size' = huge page size, | |
114 | optional suffix [kKmMgG] | |
115 | ||
116 | - For run-time allocation of huge pages | |
117 | ||
118 | `echo N > /sys/devices/system/node/nodeX/hugepages/hugepages-1048576kB/nr_hugepages` | |
119 | where 'N' = Number of huge pages requested, 'X' = NUMA Node | |
120 | ||
121 | Note: For run-time allocation of 1G huge pages, Contiguous Memory Allocator(CONFIG_CMA) | |
122 | has to be supported by kernel, check your Linux distro. | |
123 | ||
124 | - Mount huge pages | |
125 | ||
126 | `mount -t hugetlbfs -o pagesize=1G none /dev/hugepages` | |
127 | ||
128 | Note: Mount hugepages if not already mounted by default. | |
129 | ||
130 | ### 3.4 Enable Hyperthreading | |
131 | ||
132 | Requires BIOS changes | |
133 | ||
134 | With HT/SMT enabled, A Physical core appears as two logical cores. | |
135 | SMT can be utilized to spawn worker threads on logical cores of the same | |
136 | physical core there by saving additional cores. | |
137 | ||
138 | With DPDK, When pinning pmd threads to logical cores, care must be taken | |
139 | to set the correct bits in the pmd-cpu-mask to ensure that the pmd threads are | |
140 | pinned to SMT siblings. | |
141 | ||
142 | Example System configuration: | |
143 | Dual socket Machine, 2x 10 core processors, HT enabled, 40 logical cores | |
144 | ||
145 | To use two logical cores which share the same physical core for pmd threads, | |
146 | the following command can be used to identify a pair of logical cores. | |
147 | ||
148 | `cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list`, where N is the | |
149 | logical core number. | |
150 | ||
151 | In this example, it would show that cores 1 and 21 share the same physical core. | |
152 | The pmd-cpu-mask to enable two pmd threads running on these two logical cores | |
153 | (one physical core) is. | |
154 | ||
155 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=100002` | |
156 | ||
157 | ### 3.5 Isolate cores | |
158 | ||
159 | 'isolcpus' option can be used to isolate cores from the linux scheduler. | |
160 | The isolated cores can then be used to dedicatedly run HPC applications/threads. | |
161 | This helps in better application performance due to zero context switching and | |
162 | minimal cache thrashing. To run platform logic on core 0 and isolate cores | |
163 | between 1 and 19 from scheduler, Add `isolcpus=1-19` to GRUB cmdline. | |
164 | ||
165 | Note: It has been verified that core isolation has minimal advantage due to | |
166 | mature Linux scheduler in some circumstances. | |
167 | ||
168 | ### 3.6 NUMA/Cluster on Die | |
169 | ||
170 | Ideally inter NUMA datapaths should be avoided where possible as packets | |
171 | will go across QPI and there may be a slight performance penalty when | |
172 | compared with intra NUMA datapaths. On Intel Xeon Processor E5 v3, | |
173 | Cluster On Die is introduced on models that have 10 cores or more. | |
174 | This makes it possible to logically split a socket into two NUMA regions | |
175 | and again it is preferred where possible to keep critical datapaths | |
176 | within the one cluster. | |
177 | ||
178 | It is good practice to ensure that threads that are in the datapath are | |
179 | pinned to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs | |
180 | responsible for forwarding. If DPDK is built with | |
181 | CONFIG_RTE_LIBRTE_VHOST_NUMA=y, vHost User ports automatically | |
182 | detect the NUMA socket of the QEMU vCPUs and will be serviced by a PMD | |
183 | from the same node provided a core on this node is enabled in the | |
184 | pmd-cpu-mask. libnuma packages are required for this feature. | |
185 | ||
186 | ### 3.7 Compiler Optimizations | |
187 | ||
188 | The default compiler optimization level is '-O2'. Changing this to | |
189 | more aggressive compiler optimization such as '-O3 -march=native' | |
190 | with gcc(verified on 5.3.1) can produce performance gains though not | |
191 | siginificant. '-march=native' will produce optimized code on local machine | |
192 | and should be used when SW compilation is done on Testbed. | |
193 | ||
194 | ## <a name="perftune"></a> 4. Performance Tuning | |
195 | ||
196 | ### 4.1 Affinity | |
197 | ||
198 | For superior performance, DPDK pmd threads and Qemu vCPU threads | |
199 | needs to be affinitized accordingly. | |
200 | ||
201 | * PMD thread Affinity | |
202 | ||
203 | A poll mode driver (pmd) thread handles the I/O of all DPDK | |
204 | interfaces assigned to it. A pmd thread shall poll the ports | |
205 | for incoming packets, switch the packets and send to tx port. | |
206 | pmd thread is CPU bound, and needs to be affinitized to isolated | |
207 | cores for optimum performance. | |
208 | ||
209 | By setting a bit in the mask, a pmd thread is created and pinned | |
210 | to the corresponding CPU core. e.g. to run a pmd thread on core 2 | |
211 | ||
212 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=4` | |
213 | ||
214 | Note: pmd thread on a NUMA node is only created if there is | |
215 | at least one DPDK interface from that NUMA node added to OVS. | |
216 | ||
217 | * Qemu vCPU thread Affinity | |
218 | ||
219 | A VM performing simple packet forwarding or running complex packet | |
220 | pipelines has to ensure that the vCPU threads performing the work has | |
221 | as much CPU occupancy as possible. | |
222 | ||
223 | Example: On a multicore VM, multiple QEMU vCPU threads shall be spawned. | |
224 | when the DPDK 'testpmd' application that does packet forwarding | |
225 | is invoked, 'taskset' cmd should be used to affinitize the vCPU threads | |
226 | to the dedicated isolated cores on the host system. | |
227 | ||
228 | ### 4.2 Multiple poll mode driver threads | |
229 | ||
230 | With pmd multi-threading support, OVS creates one pmd thread | |
231 | for each NUMA node by default. However, it can be seen that in cases | |
232 | where there are multiple ports/rxq's producing traffic, performance | |
233 | can be improved by creating multiple pmd threads running on separate | |
234 | cores. These pmd threads can then share the workload by each being | |
235 | responsible for different ports/rxq's. Assignment of ports/rxq's to | |
236 | pmd threads is done automatically. | |
237 | ||
238 | A set bit in the mask means a pmd thread is created and pinned | |
239 | to the corresponding CPU core. e.g. to run pmd threads on core 1 and 2 | |
240 | ||
241 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=6` | |
242 | ||
243 | For example, when using dpdk and dpdkvhostuser ports in a bi-directional | |
244 | VM loopback as shown below, spreading the workload over 2 or 4 pmd | |
245 | threads shows significant improvements as there will be more total CPU | |
246 | occupancy available. | |
247 | ||
248 | NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1 | |
249 | ||
250 | ### 4.3 DPDK physical port Rx Queues | |
251 | ||
252 | `ovs-vsctl set Interface <DPDK interface> options:n_rxq=<integer>` | |
253 | ||
254 | The command above sets the number of rx queues for DPDK physical interface. | |
255 | The rx queues are assigned to pmd threads on the same NUMA node in a | |
256 | round-robin fashion. | |
257 | ||
258 | ### 4.4 Exact Match Cache | |
259 | ||
260 | Each pmd thread contains one EMC. After initial flow setup in the | |
261 | datapath, the EMC contains a single table and provides the lowest level | |
262 | (fastest) switching for DPDK ports. If there is a miss in the EMC then | |
263 | the next level where switching will occur is the datapath classifier. | |
264 | Missing in the EMC and looking up in the datapath classifier incurs a | |
265 | significant performance penalty. If lookup misses occur in the EMC | |
266 | because it is too small to handle the number of flows, its size can | |
267 | be increased. The EMC size can be modified by editing the define | |
268 | EM_FLOW_HASH_SHIFT in lib/dpif-netdev.c. | |
269 | ||
270 | As mentioned above an EMC is per pmd thread. So an alternative way of | |
271 | increasing the aggregate amount of possible flow entries in EMC and | |
272 | avoiding datapath classifier lookups is to have multiple pmd threads | |
273 | running. This can be done as described in section 4.2. | |
274 | ||
275 | ### 4.5 Rx Mergeable buffers | |
276 | ||
277 | Rx Mergeable buffers is a virtio feature that allows chaining of multiple | |
278 | virtio descriptors to handle large packet sizes. As such, large packets | |
279 | are handled by reserving and chaining multiple free descriptors | |
280 | together. Mergeable buffer support is negotiated between the virtio | |
281 | driver and virtio device and is supported by the DPDK vhost library. | |
282 | This behavior is typically supported and enabled by default, however | |
283 | in the case where the user knows that rx mergeable buffers are not needed | |
284 | i.e. jumbo frames are not needed, it can be forced off by adding | |
285 | mrg_rxbuf=off to the QEMU command line options. By not reserving multiple | |
286 | chains of descriptors it will make more individual virtio descriptors | |
287 | available for rx to the guest using dpdkvhost ports and this can improve | |
288 | performance. | |
289 | ||
290 | ## <a name="ovstc"></a> 5. OVS Testcases | |
291 | ### 5.1 PHY-VM-PHY [VHOST LOOPBACK] | |
292 | ||
293 | The section 5.2 in INSTALL.DPDK guide lists steps for PVP loopback testcase | |
294 | and packet forwarding using DPDK testpmd application in the Guest VM. | |
295 | For users wanting to do packet forwarding using kernel stack below are the steps. | |
296 | ||
297 | ``` | |
298 | ifconfig eth1 1.1.1.2/24 | |
299 | ifconfig eth2 1.1.2.2/24 | |
300 | systemctl stop firewalld.service | |
301 | systemctl stop iptables.service | |
302 | sysctl -w net.ipv4.ip_forward=1 | |
303 | sysctl -w net.ipv4.conf.all.rp_filter=0 | |
304 | sysctl -w net.ipv4.conf.eth1.rp_filter=0 | |
305 | sysctl -w net.ipv4.conf.eth2.rp_filter=0 | |
306 | route add -net 1.1.2.0/24 eth2 | |
307 | route add -net 1.1.1.0/24 eth1 | |
308 | arp -s 1.1.2.99 DE:AD:BE:EF:CA:FE | |
309 | arp -s 1.1.1.99 DE:AD:BE:EF:CA:EE | |
310 | ``` | |
311 | ||
312 | ### 5.2 PHY-VM-PHY [IVSHMEM] | |
313 | ||
314 | The steps (1-5) in 3.3 section of INSTALL.DPDK guide will create & initialize DB, | |
315 | start vswitchd and add dpdk devices to bridge br0. | |
316 | ||
317 | 1. Add DPDK ring port to the bridge | |
318 | ||
319 | ``` | |
320 | ovs-vsctl add-port br0 dpdkr0 -- set Interface dpdkr0 type=dpdkr | |
321 | ``` | |
322 | ||
323 | 2. Build modified Qemu (Qemu-2.2.1 + ivshmem-qemu-2.2.1.patch) | |
324 | ||
325 | ``` | |
326 | cd /usr/src/ | |
327 | wget http://wiki.qemu.org/download/qemu-2.2.1.tar.bz2 | |
328 | tar -jxvf qemu-2.2.1.tar.bz2 | |
329 | cd /usr/src/qemu-2.2.1 | |
330 | wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/patches/ivshmem-qemu-2.2.1.patch | |
331 | patch -p1 < ivshmem-qemu-2.2.1.patch | |
332 | ./configure --target-list=x86_64-softmmu --enable-debug --extra-cflags='-g' | |
333 | make -j 4 | |
334 | ``` | |
335 | ||
336 | 3. Generate Qemu commandline | |
337 | ||
338 | ``` | |
339 | mkdir -p /usr/src/cmdline_generator | |
340 | cd /usr/src/cmdline_generator | |
341 | wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/cmdline_generator/cmdline_generator.c | |
342 | wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/cmdline_generator/Makefile | |
343 | export RTE_SDK=/usr/src/dpdk-16.07 | |
344 | export RTE_TARGET=x86_64-ivshmem-linuxapp-gcc | |
345 | make | |
346 | ./build/cmdline_generator -m -p dpdkr0 XXX | |
347 | cmdline=`cat OVSMEMPOOL` | |
348 | ``` | |
349 | ||
350 | 4. start Guest VM | |
351 | ||
352 | ``` | |
353 | export VM_NAME=ivshmem-vm | |
354 | export QCOW2_IMAGE=/root/CentOS7_x86_64.qcow2 | |
355 | export QEMU_BIN=/usr/src/qemu-2.2.1/x86_64-softmmu/qemu-system-x86_64 | |
356 | ||
357 | taskset 0x20 $QEMU_BIN -cpu host -smp 2,cores=2 -hda $QCOW2_IMAGE -m 4096 --enable-kvm -name $VM_NAME -nographic -vnc :2 -pidfile /tmp/vm1.pid $cmdline | |
358 | ``` | |
359 | ||
360 | 5. Running sample "dpdk ring" app in VM | |
361 | ||
362 | ``` | |
363 | echo 1024 > /proc/sys/vm/nr_hugepages | |
364 | mount -t hugetlbfs nodev /dev/hugepages (if not already mounted) | |
365 | ||
366 | # Build the DPDK ring application in the VM | |
367 | export RTE_SDK=/root/dpdk-16.07 | |
368 | export RTE_TARGET=x86_64-ivshmem-linuxapp-gcc | |
369 | make | |
370 | ||
371 | # Run dpdkring application | |
372 | ./build/dpdkr -c 1 -n 4 -- -n 0 | |
373 | where "-n 0" refers to ring '0' i.e dpdkr0 | |
374 | ``` | |
375 | ||
376 | ### 5.3 PHY-VM-PHY [VHOST MULTIQUEUE] | |
377 | ||
378 | The steps (1-5) in 3.3 section of [INSTALL DPDK] guide will create & initialize DB, | |
379 | start vswitchd and add dpdk devices to bridge br0. | |
380 | ||
381 | 1. Configure PMD and RXQs. For example set no. of dpdk port rx queues to atleast 2. | |
382 | The number of rx queues at vhost-user interface gets automatically configured after | |
383 | virtio device connection and doesn't need manual configuration. | |
384 | ||
385 | ``` | |
386 | ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=c | |
387 | ovs-vsctl set Interface dpdk0 options:n_rxq=2 | |
388 | ovs-vsctl set Interface dpdk1 options:n_rxq=2 | |
389 | ``` | |
390 | ||
391 | 2. Instantiate Guest VM using Qemu cmdline | |
392 | ||
393 | Guest Configuration | |
394 | ||
395 | ``` | |
396 | | configuration | values | comments | |
397 | |----------------------|--------|----------------- | |
398 | | qemu version | 2.5.0 | | |
399 | | qemu thread affinity |2 cores | taskset 0x30 | |
400 | | memory | 4GB | - | |
401 | | cores | 2 | - | |
402 | | Qcow2 image |Fedora22| - | |
403 | | multiqueue | on | - | |
404 | ``` | |
405 | ||
406 | Instantiate Guest | |
407 | ||
408 | ``` | |
409 | export VM_NAME=vhost-vm | |
410 | export GUEST_MEM=4096M | |
411 | export QCOW2_IMAGE=/root/Fedora22_x86_64.qcow2 | |
412 | export VHOST_SOCK_DIR=/usr/local/var/run/openvswitch | |
413 | ||
414 | taskset 0x30 qemu-system-x86_64 -cpu host -smp 2,cores=2 -drive file=$QCOW2_IMAGE -m 4096M --enable-kvm -name $VM_NAME -nographic -object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on -numa node,memdev=mem -mem-prealloc -chardev socket,id=char1,path=$VHOST_SOCK_DIR/dpdkvhostuser0 -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=2 -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mq=on,vectors=6 -chardev socket,id=char2,path=$VHOST_SOCK_DIR/dpdkvhostuser1 -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=2 -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=6 | |
415 | ``` | |
416 | ||
417 | Note: Queue value above should match the queues configured in OVS, The vector value | |
418 | should be set to 'no. of queues x 2 + 2'. | |
419 | ||
420 | 3. Guest interface configuration | |
421 | ||
422 | Assuming there are 2 interfaces in the guest named eth0, eth1 check the channel | |
423 | configuration and set the number of combined channels to 2 for virtio devices. | |
424 | More information can be found in [Vhost walkthrough] section. | |
425 | ||
426 | ``` | |
427 | ethtool -l eth0 | |
428 | ethtool -L eth0 combined 2 | |
429 | ethtool -L eth1 combined 2 | |
430 | ``` | |
431 | ||
432 | 4. Kernel Packet forwarding | |
433 | ||
434 | Configure IP and enable interfaces | |
435 | ||
436 | ``` | |
437 | ifconfig eth0 5.5.5.1/24 up | |
438 | ifconfig eth1 90.90.90.1/24 up | |
439 | ``` | |
440 | ||
441 | Configure IP forwarding and add route entries | |
442 | ||
443 | ``` | |
444 | sysctl -w net.ipv4.ip_forward=1 | |
445 | sysctl -w net.ipv4.conf.all.rp_filter=0 | |
446 | sysctl -w net.ipv4.conf.eth0.rp_filter=0 | |
447 | sysctl -w net.ipv4.conf.eth1.rp_filter=0 | |
448 | ip route add 2.1.1.0/24 dev eth1 | |
449 | route add default gw 2.1.1.2 eth1 | |
450 | route add default gw 90.90.90.90 eth1 | |
451 | arp -s 90.90.90.90 DE:AD:BE:EF:CA:FE | |
452 | arp -s 2.1.1.2 DE:AD:BE:EF:CA:FA | |
453 | ``` | |
454 | ||
455 | Check traffic on multiple queues | |
456 | ||
457 | ``` | |
458 | cat /proc/interrupts | grep virtio | |
459 | ``` | |
460 | ||
461 | ## <a name="vhost"></a> 6. Vhost Walkthrough | |
462 | ||
463 | DPDK 16.07 supports two types of vhost: | |
464 | ||
465 | 1. vhost-user - enabled default | |
466 | ||
467 | 2. vhost-cuse - Legacy, disabled by default | |
468 | ||
469 | ### 6.1 vhost-user | |
470 | ||
471 | - Prerequisites: | |
472 | ||
473 | QEMU version >= 2.2 | |
474 | ||
475 | - Adding vhost-user ports to Switch | |
476 | ||
477 | Unlike DPDK ring ports, DPDK vhost-user ports can have arbitrary names, | |
478 | except that forward and backward slashes are prohibited in the names. | |
479 | ||
480 | For vhost-user, the name of the port type is `dpdkvhostuser` | |
481 | ||
482 | ``` | |
483 | ovs-vsctl add-port br0 vhost-user-1 -- set Interface vhost-user-1 | |
484 | type=dpdkvhostuser | |
485 | ``` | |
486 | ||
487 | This action creates a socket located at | |
488 | `/usr/local/var/run/openvswitch/vhost-user-1`, which you must provide | |
489 | to your VM on the QEMU command line. More instructions on this can be | |
490 | found in the next section "Adding vhost-user ports to VM" | |
491 | ||
492 | Note: If you wish for the vhost-user sockets to be created in a | |
493 | sub-directory of `/usr/local/var/run/openvswitch`, you may specify | |
494 | this directory in the ovsdb like so: | |
495 | ||
496 | `./utilities/ovs-vsctl --no-wait \ | |
497 | set Open_vSwitch . other_config:vhost-sock-dir=subdir` | |
498 | ||
499 | - Adding vhost-user ports to VM | |
500 | ||
501 | 1. Configure sockets | |
502 | ||
503 | Pass the following parameters to QEMU to attach a vhost-user device: | |
504 | ||
505 | ``` | |
506 | -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 | |
507 | -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce | |
508 | -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 | |
509 | ``` | |
510 | ||
511 | where vhost-user-1 is the name of the vhost-user port added | |
512 | to the switch. | |
513 | Repeat the above parameters for multiple devices, changing the | |
514 | chardev path and id as necessary. Note that a separate and different | |
515 | chardev path needs to be specified for each vhost-user device. For | |
516 | example you have a second vhost-user port named 'vhost-user-2', you | |
517 | append your QEMU command line with an additional set of parameters: | |
518 | ||
519 | ``` | |
520 | -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 | |
521 | -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce | |
522 | -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2 | |
523 | ``` | |
524 | ||
525 | 2. Configure huge pages. | |
526 | ||
527 | QEMU must allocate the VM's memory on hugetlbfs. vhost-user ports access | |
528 | a virtio-net device's virtual rings and packet buffers mapping the VM's | |
529 | physical memory on hugetlbfs. To enable vhost-user ports to map the VM's | |
530 | memory into their process address space, pass the following parameters | |
531 | to QEMU: | |
532 | ||
533 | ``` | |
534 | -object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages, | |
535 | share=on -numa node,memdev=mem -mem-prealloc | |
536 | ``` | |
537 | ||
538 | 3. Enable multiqueue support(OPTIONAL) | |
539 | ||
540 | QEMU needs to be configured to use multiqueue. | |
541 | The $q below is the number of queues. | |
542 | The $v is the number of vectors, which is '$q x 2 + 2'. | |
543 | ||
544 | ``` | |
545 | -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 | |
546 | -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=$q | |
547 | -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=$v | |
548 | ``` | |
549 | ||
550 | The vhost-user interface will be automatically reconfigured with required | |
551 | number of rx and tx queues after connection of virtio device. | |
552 | Manual configuration of `n_rxq` is not supported because OVS will work | |
553 | properly only if `n_rxq` will match number of queues configured in QEMU. | |
554 | ||
555 | A least 2 PMDs should be configured for the vswitch when using multiqueue. | |
556 | Using a single PMD will cause traffic to be enqueued to the same vhost | |
557 | queue rather than being distributed among different vhost queues for a | |
558 | vhost-user interface. | |
559 | ||
560 | If traffic destined for a VM configured with multiqueue arrives to the | |
561 | vswitch via a physical DPDK port, then the number of rxqs should also be | |
562 | set to at least 2 for that physical DPDK port. This is required to increase | |
563 | the probability that a different PMD will handle the multiqueue | |
564 | transmission to the guest using a different vhost queue. | |
565 | ||
566 | If one wishes to use multiple queues for an interface in the guest, the | |
567 | driver in the guest operating system must be configured to do so. It is | |
568 | recommended that the number of queues configured be equal to '$q'. | |
569 | ||
570 | For example, this can be done for the Linux kernel virtio-net driver with: | |
571 | ||
572 | ``` | |
573 | ethtool -L <DEV> combined <$q> | |
574 | ``` | |
575 | where `-L`: Changes the numbers of channels of the specified network device | |
576 | and `combined`: Changes the number of multi-purpose channels. | |
577 | ||
578 | - VM Configuration with libvirt | |
579 | ||
580 | * change the user/group, access control policty and restart libvirtd. | |
581 | ||
582 | - In `/etc/libvirt/qemu.conf` add/edit the following lines | |
583 | ||
584 | ``` | |
585 | user = "root" | |
586 | group = "root" | |
587 | ``` | |
588 | ||
589 | - Disable SELinux or set to permissive mode | |
590 | ||
591 | `setenforce 0` | |
592 | ||
593 | - Restart the libvirtd process, For example, on Fedora | |
594 | ||
595 | `systemctl restart libvirtd.service` | |
596 | ||
597 | * Instantiate the VM | |
598 | ||
599 | - Copy the xml configuration from [Guest VM using libvirt] in to workspace. | |
600 | ||
601 | - Start the VM. | |
602 | ||
603 | `virsh create demovm.xml` | |
604 | ||
605 | - Connect to the guest console | |
606 | ||
607 | `virsh console demovm` | |
608 | ||
609 | * VM configuration | |
610 | ||
611 | The demovm xml configuration is aimed at achieving out of box performance | |
612 | on VM. | |
613 | ||
614 | - The vcpus are pinned to the cores of the CPU socket 0 using vcpupin. | |
615 | ||
616 | - Configure NUMA cell and memory shared using memAccess='shared'. | |
617 | ||
618 | - Disable mrg_rxbuf='off'. | |
619 | ||
620 | Note: For information on libvirt and further tuning refer [libvirt]. | |
621 | ||
622 | ### 6.2 vhost-cuse | |
623 | ||
624 | - Prerequisites: | |
625 | ||
626 | QEMU version >= 2.2 | |
627 | ||
628 | - Enable vhost-cuse support | |
629 | ||
630 | 1. Enable vhost cuse support in DPDK | |
631 | ||
632 | Set `CONFIG_RTE_LIBRTE_VHOST_USER=n` in config/common_linuxapp and follow the | |
633 | steps in 2.2 section of INSTALL.DPDK guide to build DPDK with cuse support. | |
634 | OVS will detect that DPDK has vhost-cuse libraries compiled and in turn will enable | |
635 | support for it in the switch and disable vhost-user support. | |
636 | ||
637 | 2. Insert the Cuse module | |
638 | ||
639 | `modprobe cuse` | |
640 | ||
641 | 3. Build and insert the `eventfd_link` module | |
642 | ||
643 | ``` | |
644 | cd $DPDK_DIR/lib/librte_vhost/eventfd_link/ | |
645 | make | |
646 | insmod $DPDK_DIR/lib/librte_vhost/eventfd_link.ko | |
647 | ``` | |
648 | ||
649 | - Adding vhost-cuse ports to Switch | |
650 | ||
651 | Unlike DPDK ring ports, DPDK vhost-cuse ports can have arbitrary names. | |
652 | For vhost-cuse, the name of the port type is `dpdkvhostcuse` | |
653 | ||
654 | ``` | |
655 | ovs-vsctl add-port br0 vhost-cuse-1 -- set Interface vhost-cuse-1 | |
656 | type=dpdkvhostcuse | |
657 | ``` | |
658 | ||
659 | When attaching vhost-cuse ports to QEMU, the name provided during the | |
660 | add-port operation must match the ifname parameter on the QEMU cmd line. | |
661 | ||
662 | - Adding vhost-cuse ports to VM | |
663 | ||
664 | vhost-cuse ports use a Linux* character device to communicate with QEMU. | |
665 | By default it is set to `/dev/vhost-net`. It is possible to reuse this | |
666 | standard device for DPDK vhost, which makes setup a little simpler but it | |
667 | is better practice to specify an alternative character device in order to | |
668 | avoid any conflicts if kernel vhost is to be used in parallel. | |
669 | ||
670 | 1. This step is only needed if using an alternative character device. | |
671 | ||
672 | ``` | |
673 | ./utilities/ovs-vsctl --no-wait set Open_vSwitch . \ | |
674 | other_config:cuse-dev-name=my-vhost-net | |
675 | ``` | |
676 | ||
677 | In the example above, the character device to be used will be | |
678 | `/dev/my-vhost-net`. | |
679 | ||
680 | 2. In case of reusing kernel vhost character device, there would be conflict | |
681 | user should remove it. | |
682 | ||
683 | `rm -rf /dev/vhost-net` | |
684 | ||
685 | 3. Configure virtio-net adapters | |
686 | ||
687 | The following parameters must be passed to the QEMU binary, repeat | |
688 | the below parameters for multiple devices. | |
689 | ||
690 | ``` | |
691 | -netdev tap,id=<id>,script=no,downscript=no,ifname=<name>,vhost=on | |
692 | -device virtio-net-pci,netdev=net1,mac=<mac> | |
693 | ``` | |
694 | ||
695 | The DPDK vhost library will negotiate its own features, so they | |
696 | need not be passed in as command line params. Note that as offloads | |
697 | are disabled this is the equivalent of setting | |
698 | ||
699 | `csum=off,gso=off,guest_tso4=off,guest_tso6=off,guest_ecn=off` | |
700 | ||
701 | When using an alternative character device, it must be explicitly | |
702 | passed to QEMU using the `vhostfd` argument | |
703 | ||
704 | ``` | |
705 | -netdev tap,id=<id>,script=no,downscript=no,ifname=<name>,vhost=on, | |
706 | vhostfd=<open_fd> -device virtio-net-pci,netdev=net1,mac=<mac> | |
707 | ``` | |
708 | ||
709 | The open file descriptor must be passed to QEMU running as a child | |
710 | process. This could be done with a simple python script. | |
711 | ||
712 | ``` | |
713 | #!/usr/bin/python | |
714 | fd = os.open("/dev/usvhost", os.O_RDWR) | |
715 | subprocess.call("qemu-system-x86_64 .... -netdev tap,id=vhostnet0,\ | |
716 | vhost=on,vhostfd=" + fd +"...", shell=True) | |
717 | ``` | |
718 | ||
719 | 4. Configure huge pages | |
720 | ||
721 | QEMU must allocate the VM's memory on hugetlbfs. Vhost ports access a | |
722 | virtio-net device's virtual rings and packet buffers mapping the VM's | |
723 | physical memory on hugetlbfs. To enable vhost-ports to map the VM's | |
724 | memory into their process address space, pass the following parameters | |
725 | to QEMU | |
726 | ||
727 | `-object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages, | |
728 | share=on -numa node,memdev=mem -mem-prealloc` | |
729 | ||
730 | - VM Configuration with QEMU wrapper | |
731 | ||
732 | The QEMU wrapper script automatically detects and calls QEMU with the | |
733 | necessary parameters. It performs the following actions: | |
734 | ||
735 | * Automatically detects the location of the hugetlbfs and inserts this | |
736 | into the command line parameters. | |
737 | * Automatically open file descriptors for each virtio-net device and | |
738 | inserts this into the command line parameters. | |
739 | * Calls QEMU passing both the command line parameters passed to the | |
740 | script itself and those it has auto-detected. | |
741 | ||
742 | Before use, you **must** edit the configuration parameters section of the | |
743 | script to point to the correct emulator location and set additional | |
744 | settings. Of these settings, `emul_path` and `us_vhost_path` **must** be | |
745 | set. All other settings are optional. | |
746 | ||
747 | To use directly from the command line simply pass the wrapper some of the | |
748 | QEMU parameters: it will configure the rest. For example: | |
749 | ||
750 | ``` | |
751 | qemu-wrap.py -cpu host -boot c -hda <disk image> -m 4096 -smp 4 | |
752 | --enable-kvm -nographic -vnc none -net none -netdev tap,id=net1, | |
753 | script=no,downscript=no,ifname=if1,vhost=on -device virtio-net-pci, | |
754 | netdev=net1,mac=00:00:00:00:00:01 | |
755 | ``` | |
756 | ||
757 | - VM Configuration with libvirt | |
758 | ||
759 | If you are using libvirt, you must enable libvirt to access the character | |
760 | device by adding it to controllers cgroup for libvirtd using the following | |
761 | steps. | |
762 | ||
763 | 1. In `/etc/libvirt/qemu.conf` add/edit the following lines: | |
764 | ||
765 | ``` | |
766 | clear_emulator_capabilities = 0 | |
767 | user = "root" | |
768 | group = "root" | |
769 | cgroup_device_acl = [ | |
770 | "/dev/null", "/dev/full", "/dev/zero", | |
771 | "/dev/random", "/dev/urandom", | |
772 | "/dev/ptmx", "/dev/kvm", "/dev/kqemu", | |
773 | "/dev/rtc", "/dev/hpet", "/dev/net/tun", | |
774 | "/dev/<my-vhost-device>", | |
775 | "/dev/hugepages"] | |
776 | ``` | |
777 | ||
778 | <my-vhost-device> refers to "vhost-net" if using the `/dev/vhost-net` | |
779 | device. If you have specificed a different name in the database | |
780 | using the "other_config:cuse-dev-name" parameter, please specify that | |
781 | filename instead. | |
782 | ||
783 | 2. Disable SELinux or set to permissive mode | |
784 | ||
785 | 3. Restart the libvirtd process | |
786 | For example, on Fedora: | |
787 | ||
788 | `systemctl restart libvirtd.service` | |
789 | ||
790 | After successfully editing the configuration, you may launch your | |
791 | vhost-enabled VM. The XML describing the VM can be configured like so | |
792 | within the <qemu:commandline> section: | |
793 | ||
794 | 1. Set up shared hugepages: | |
795 | ||
796 | ``` | |
797 | <qemu:arg value='-object'/> | |
798 | <qemu:arg value='memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on'/> | |
799 | <qemu:arg value='-numa'/> | |
800 | <qemu:arg value='node,memdev=mem'/> | |
801 | <qemu:arg value='-mem-prealloc'/> | |
802 | ``` | |
803 | ||
804 | 2. Set up your tap devices: | |
805 | ||
806 | ``` | |
807 | <qemu:arg value='-netdev'/> | |
808 | <qemu:arg value='type=tap,id=net1,script=no,downscript=no,ifname=vhost0,vhost=on'/> | |
809 | <qemu:arg value='-device'/> | |
810 | <qemu:arg value='virtio-net-pci,netdev=net1,mac=00:00:00:00:00:01'/> | |
811 | ``` | |
812 | ||
813 | Repeat for as many devices as are desired, modifying the id, ifname | |
814 | and mac as necessary. | |
815 | ||
816 | Again, if you are using an alternative character device (other than | |
817 | `/dev/vhost-net`), please specify the file descriptor like so: | |
818 | ||
819 | `<qemu:arg value='type=tap,id=net3,script=no,downscript=no,ifname=vhost0,vhost=on,vhostfd=<open_fd>'/>` | |
820 | ||
821 | Where <open_fd> refers to the open file descriptor of the character device. | |
822 | Instructions of how to retrieve the file descriptor can be found in the | |
823 | "DPDK vhost VM configuration" section. | |
824 | Alternatively, the process is automated with the qemu-wrap.py script, | |
825 | detailed in the next section. | |
826 | ||
827 | Now you may launch your VM using virt-manager, or like so: | |
828 | ||
829 | `virsh create my_vhost_vm.xml` | |
830 | ||
831 | - VM Configuration with libvirt & QEMU wrapper | |
832 | ||
833 | To use the qemu-wrapper script in conjuntion with libvirt, follow the | |
834 | steps in the previous section before proceeding with the following steps: | |
835 | ||
836 | 1. Place `qemu-wrap.py` in libvirtd binary search PATH ($PATH) | |
837 | Ideally in the same directory that the QEMU binary is located. | |
838 | ||
839 | 2. Ensure that the script has the same owner/group and file permissions | |
840 | as the QEMU binary. | |
841 | ||
842 | 3. Update the VM xml file using "virsh edit VM.xml" | |
843 | ||
844 | Set the VM to use the launch script. | |
845 | Set the emulator path contained in the `<emulator><emulator/>` tags. | |
846 | For example, replace `<emulator>/usr/bin/qemu-kvm<emulator/>` with | |
847 | `<emulator>/usr/bin/qemu-wrap.py<emulator/>` | |
848 | ||
849 | 4. Edit the Configuration Parameters section of the script to point to | |
850 | the correct emulator location and set any additional options. If you are | |
851 | using a alternative character device name, please set "us_vhost_path" to the | |
852 | location of that device. The script will automatically detect and insert | |
853 | the correct "vhostfd" value in the QEMU command line arguments. | |
854 | ||
855 | 5. Use virt-manager to launch the VM | |
856 | ||
857 | ### 6.3 DPDK backend inside VM | |
858 | ||
859 | Please note that additional configuration is required if you want to run | |
860 | ovs-vswitchd with DPDK backend inside a QEMU virtual machine. Ovs-vswitchd | |
861 | creates separate DPDK TX queues for each CPU core available. This operation | |
862 | fails inside QEMU virtual machine because, by default, VirtIO NIC provided | |
863 | to the guest is configured to support only single TX queue and single RX | |
864 | queue. To change this behavior, you need to turn on 'mq' (multiqueue) | |
865 | property of all virtio-net-pci devices emulated by QEMU and used by DPDK. | |
866 | You may do it manually (by changing QEMU command line) or, if you use | |
867 | Libvirt, by adding the following string: | |
868 | ||
869 | `<driver name='vhost' queues='N'/>` | |
870 | ||
871 | to <interface> sections of all network devices used by DPDK. Parameter 'N' | |
872 | determines how many queues can be used by the guest.This may not work with | |
873 | old versions of QEMU found in some distros and need Qemu version >= 2.2. | |
874 | ||
875 | ## <a name="qos"></a> 7. QOS | |
876 | ||
877 | Here is an example on QOS usage. | |
878 | Assuming you have a vhost-user port transmitting traffic consisting of | |
879 | packets of size 64 bytes, the following command would limit the egress | |
880 | transmission rate of the port to ~1,000,000 packets per second | |
881 | ||
882 | `ovs-vsctl set port vhost-user0 qos=@newqos -- --id=@newqos create qos | |
883 | type=egress-policer other-config:cir=46000000 other-config:cbs=2048` | |
884 | ||
885 | To examine the QoS configuration of the port: | |
886 | ||
887 | `ovs-appctl -t ovs-vswitchd qos/show vhost-user0` | |
888 | ||
889 | To clear the QoS configuration from the port and ovsdb use the following: | |
890 | ||
891 | `ovs-vsctl destroy QoS vhost-user0 -- clear Port vhost-user0 qos` | |
892 | ||
893 | For more details regarding egress-policer parameters please refer to the | |
894 | vswitch.xml. | |
895 | ||
896 | ## <a name="rl"></a> 8. Rate Limiting | |
897 | ||
898 | Here is an example on Ingress Policing usage. | |
899 | Assuming you have a vhost-user port receiving traffic consisting of | |
900 | packets of size 64 bytes, the following command would limit the reception | |
901 | rate of the port to ~1,000,000 packets per second: | |
902 | ||
903 | `ovs-vsctl set interface vhost-user0 ingress_policing_rate=368000 | |
904 | ingress_policing_burst=1000` | |
905 | ||
906 | To examine the ingress policer configuration of the port: | |
907 | ||
908 | `ovs-vsctl list interface vhost-user0` | |
909 | ||
910 | To clear the ingress policer configuration from the port use the following: | |
911 | ||
912 | `ovs-vsctl set interface vhost-user0 ingress_policing_rate=0` | |
913 | ||
914 | For more details regarding ingress-policer see the vswitch.xml. | |
915 | ||
916 | ## <a name="fc"></a> 9. Flow control. | |
917 | Flow control can be enabled only on DPDK physical ports. | |
918 | To enable flow control support at tx side while adding a port, add the | |
919 | 'tx-flow-ctrl' option to the 'ovs-vsctl add-port' as in the eg: below. | |
920 | ||
921 | ``` | |
922 | ovs-vsctl add-port br0 dpdk0 -- \ | |
923 | set Interface dpdk0 type=dpdk options:tx-flow-ctrl=true | |
924 | ``` | |
925 | ||
926 | Similarly to enable rx flow control, | |
927 | ||
928 | ``` | |
929 | ovs-vsctl add-port br0 dpdk0 -- \ | |
930 | set Interface dpdk0 type=dpdk options:rx-flow-ctrl=true | |
931 | ``` | |
932 | ||
933 | And to enable the flow control auto-negotiation, | |
934 | ||
935 | ``` | |
936 | ovs-vsctl add-port br0 dpdk0 -- \ | |
937 | set Interface dpdk0 type=dpdk options:flow-ctrl-autoneg=true | |
938 | ``` | |
939 | ||
940 | To turn ON the tx flow control at run time(After the port is being added | |
941 | to OVS), the command-line input will be, | |
942 | ||
943 | `ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=true` | |
944 | ||
945 | The flow control parameters can be turned off by setting 'false' to the | |
946 | respective parameter. To disable the flow control at tx side, | |
947 | ||
948 | `ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false` | |
949 | ||
950 | ## <a name="vsperf"></a> 10. Vsperf | |
951 | ||
952 | Vsperf project goal is to develop vSwitch test framework that can be used to | |
953 | validate the suitability of different vSwitch implementations in a Telco deployment | |
954 | environment. More information can be found in below link. | |
955 | ||
956 | https://wiki.opnfv.org/display/vsperf/VSperf+Home | |
957 | ||
958 | ||
959 | Bug Reporting: | |
960 | -------------- | |
961 | ||
962 | Please report problems to bugs@openvswitch.org. | |
963 | ||
964 | ||
965 | [INSTALL.userspace.md]:INSTALL.userspace.md | |
966 | [INSTALL.md]:INSTALL.md | |
967 | [DPDK Linux GSG]: http://www.dpdk.org/doc/guides/linux_gsg/build_dpdk.html#binding-and-unbinding-network-ports-to-from-the-igb-uioor-vfio-modules | |
968 | [DPDK Docs]: http://dpdk.org/doc | |
969 | [libvirt]: http://libvirt.org/formatdomain.html | |
970 | [Guest VM using libvirt]: INSTALL.DPDK.md#ovstc | |
971 | [Vhost walkthrough]: INSTALL.DPDK.md#vhost | |
972 | [INSTALL DPDK]: INSTALL.DPDK.md#build | |
973 | [INSTALL OVS]: INSTALL.DPDK.md#build |