]>
Commit | Line | Data |
---|---|---|
c9b9d6df | 1 | OVS DPDK ADVANCED INSTALL GUIDE |
0072e931 | 2 | =============================== |
c9b9d6df BB |
3 | |
4 | ## Contents | |
5 | ||
6 | 1. [Overview](#overview) | |
7 | 2. [Building Shared Library](#build) | |
8 | 3. [System configuration](#sysconf) | |
9 | 4. [Performance Tuning](#perftune) | |
10 | 5. [OVS Testcases](#ovstc) | |
11 | 6. [Vhost Walkthrough](#vhost) | |
12 | 7. [QOS](#qos) | |
13 | 8. [Rate Limiting](#rl) | |
9fd39370 | 14 | 9. [Flow Control](#fc) |
4b88d678 | 15 | 10. [Pdump](#pdump) |
0072e931 MK |
16 | 11. [Jumbo Frames](#jumbo) |
17 | 12. [Vsperf](#vsperf) | |
c9b9d6df BB |
18 | |
19 | ## <a name="overview"></a> 1. Overview | |
20 | ||
21 | The Advanced Install Guide explains how to improve OVS performance using | |
22 | DPDK datapath. This guide also provides information on tuning, system configuration, | |
23 | troubleshooting, static code analysis and testcases. | |
24 | ||
25 | ## <a name="build"></a> 2. Building Shared Library | |
26 | ||
27 | DPDK can be built as static or shared library and shall be linked by applications | |
28 | using DPDK datapath. The section lists steps to build shared library and dynamically | |
29 | link DPDK against OVS. | |
30 | ||
31 | Note: Minor performance loss is seen with OVS when using shared DPDK library as | |
32 | compared to static library. | |
33 | ||
34 | Check section [INSTALL DPDK], [INSTALL OVS] of INSTALL.DPDK on download instructions | |
35 | for DPDK and OVS. | |
36 | ||
37 | * Configure the DPDK library | |
38 | ||
39 | Set `CONFIG_RTE_BUILD_SHARED_LIB=y` in `config/common_base` | |
40 | to generate shared DPDK library | |
41 | ||
42 | ||
43 | * Build and install DPDK | |
44 | ||
45 | For Default install (without IVSHMEM), set `export DPDK_TARGET=x86_64-native-linuxapp-gcc` | |
46 | For IVSHMEM case, set `export DPDK_TARGET=x86_64-ivshmem-linuxapp-gcc` | |
47 | ||
48 | ``` | |
0a0f39df | 49 | export DPDK_DIR=/usr/src/dpdk-16.07 |
c9b9d6df BB |
50 | export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET |
51 | make install T=$DPDK_TARGET DESTDIR=install | |
52 | ``` | |
53 | ||
54 | * Build, Install and Setup OVS. | |
55 | ||
56 | Export the DPDK shared library location and setup OVS as listed in | |
57 | section 3.3 of INSTALL.DPDK. | |
58 | ||
59 | `export LD_LIBRARY_PATH=$DPDK_DIR/x86_64-native-linuxapp-gcc/lib` | |
60 | ||
61 | ## <a name="sysconf"></a> 3. System Configuration | |
62 | ||
63 | To achieve optimal OVS performance, the system can be configured and that includes | |
64 | BIOS tweaks, Grub cmdline additions, better understanding of NUMA nodes and | |
65 | apt selection of PCIe slots for NIC placement. | |
66 | ||
67 | ### 3.1 Recommended BIOS settings | |
68 | ||
69 | ``` | |
70 | | Settings | values | comments | |
71 | |---------------------------|-----------|----------- | |
72 | | C3 power state | Disabled | - | |
73 | | C6 power state | Disabled | - | |
74 | | MLC Streamer | Enabled | - | |
75 | | MLC Spacial prefetcher | Enabled | - | |
76 | | DCU Data prefetcher | Enabled | - | |
77 | | DCA | Enabled | - | |
78 | | CPU power and performance | Performance - | |
79 | | Memory RAS and perf | | - | |
80 | config-> NUMA optimized | Enabled | - | |
81 | ``` | |
82 | ||
83 | ### 3.2 PCIe Slot Selection | |
84 | ||
85 | The fastpath performance also depends on factors like the NIC placement, | |
86 | Channel speeds between PCIe slot and CPU, proximity of PCIe slot to the CPU | |
87 | cores running DPDK application. Listed below are the steps to identify | |
88 | right PCIe slot. | |
89 | ||
90 | - Retrieve host details using cmd `dmidecode -t baseboard | grep "Product Name"` | |
91 | - Download the technical specification for Product listed eg: S2600WT2. | |
92 | - Check the Product Architecture Overview on the Riser slot placement, | |
93 | CPU sharing info and also PCIe channel speeds. | |
94 | ||
95 | example: On S2600WT, CPU1 and CPU2 share Riser Slot 1 with Channel speed between | |
96 | CPU1 and Riser Slot1 at 32GB/s, CPU2 and Riser Slot1 at 16GB/s. Running DPDK app | |
97 | on CPU1 cores and NIC inserted in to Riser card Slots will optimize OVS performance | |
98 | in this case. | |
99 | ||
100 | - Check the Riser Card #1 - Root Port mapping information, on the available slots | |
101 | and individual bus speeds. In S2600WT slot 1, slot 2 has high bus speeds and are | |
102 | potential slots for NIC placement. | |
103 | ||
104 | ### 3.3 Advanced Hugepage setup | |
105 | ||
106 | Allocate and mount 1G Huge pages: | |
107 | ||
108 | - For persistent allocation of huge pages, add the following options to the kernel bootline | |
109 | ||
110 | Add `default_hugepagesz=1GB hugepagesz=1G hugepages=N` | |
111 | ||
112 | For platforms supporting multiple huge page sizes, Add options | |
113 | ||
114 | `default_hugepagesz=<size> hugepagesz=<size> hugepages=N` | |
115 | where 'N' = Number of huge pages requested, 'size' = huge page size, | |
116 | optional suffix [kKmMgG] | |
117 | ||
118 | - For run-time allocation of huge pages | |
119 | ||
120 | `echo N > /sys/devices/system/node/nodeX/hugepages/hugepages-1048576kB/nr_hugepages` | |
121 | where 'N' = Number of huge pages requested, 'X' = NUMA Node | |
122 | ||
123 | Note: For run-time allocation of 1G huge pages, Contiguous Memory Allocator(CONFIG_CMA) | |
124 | has to be supported by kernel, check your Linux distro. | |
125 | ||
126 | - Mount huge pages | |
127 | ||
128 | `mount -t hugetlbfs -o pagesize=1G none /dev/hugepages` | |
129 | ||
130 | Note: Mount hugepages if not already mounted by default. | |
131 | ||
132 | ### 3.4 Enable Hyperthreading | |
133 | ||
134 | Requires BIOS changes | |
135 | ||
136 | With HT/SMT enabled, A Physical core appears as two logical cores. | |
137 | SMT can be utilized to spawn worker threads on logical cores of the same | |
138 | physical core there by saving additional cores. | |
139 | ||
140 | With DPDK, When pinning pmd threads to logical cores, care must be taken | |
141 | to set the correct bits in the pmd-cpu-mask to ensure that the pmd threads are | |
142 | pinned to SMT siblings. | |
143 | ||
144 | Example System configuration: | |
145 | Dual socket Machine, 2x 10 core processors, HT enabled, 40 logical cores | |
146 | ||
147 | To use two logical cores which share the same physical core for pmd threads, | |
148 | the following command can be used to identify a pair of logical cores. | |
149 | ||
150 | `cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list`, where N is the | |
151 | logical core number. | |
152 | ||
153 | In this example, it would show that cores 1 and 21 share the same physical core. | |
154 | The pmd-cpu-mask to enable two pmd threads running on these two logical cores | |
155 | (one physical core) is. | |
156 | ||
157 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=100002` | |
158 | ||
159 | ### 3.5 Isolate cores | |
160 | ||
161 | 'isolcpus' option can be used to isolate cores from the linux scheduler. | |
162 | The isolated cores can then be used to dedicatedly run HPC applications/threads. | |
163 | This helps in better application performance due to zero context switching and | |
164 | minimal cache thrashing. To run platform logic on core 0 and isolate cores | |
165 | between 1 and 19 from scheduler, Add `isolcpus=1-19` to GRUB cmdline. | |
166 | ||
167 | Note: It has been verified that core isolation has minimal advantage due to | |
168 | mature Linux scheduler in some circumstances. | |
169 | ||
170 | ### 3.6 NUMA/Cluster on Die | |
171 | ||
172 | Ideally inter NUMA datapaths should be avoided where possible as packets | |
173 | will go across QPI and there may be a slight performance penalty when | |
174 | compared with intra NUMA datapaths. On Intel Xeon Processor E5 v3, | |
175 | Cluster On Die is introduced on models that have 10 cores or more. | |
176 | This makes it possible to logically split a socket into two NUMA regions | |
177 | and again it is preferred where possible to keep critical datapaths | |
178 | within the one cluster. | |
179 | ||
180 | It is good practice to ensure that threads that are in the datapath are | |
181 | pinned to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs | |
182 | responsible for forwarding. If DPDK is built with | |
183 | CONFIG_RTE_LIBRTE_VHOST_NUMA=y, vHost User ports automatically | |
184 | detect the NUMA socket of the QEMU vCPUs and will be serviced by a PMD | |
185 | from the same node provided a core on this node is enabled in the | |
435aaddd | 186 | pmd-cpu-mask. libnuma packages are required for this feature. |
c9b9d6df BB |
187 | |
188 | ### 3.7 Compiler Optimizations | |
189 | ||
190 | The default compiler optimization level is '-O2'. Changing this to | |
191 | more aggressive compiler optimization such as '-O3 -march=native' | |
192 | with gcc(verified on 5.3.1) can produce performance gains though not | |
193 | siginificant. '-march=native' will produce optimized code on local machine | |
194 | and should be used when SW compilation is done on Testbed. | |
195 | ||
196 | ## <a name="perftune"></a> 4. Performance Tuning | |
197 | ||
198 | ### 4.1 Affinity | |
199 | ||
200 | For superior performance, DPDK pmd threads and Qemu vCPU threads | |
201 | needs to be affinitized accordingly. | |
202 | ||
203 | * PMD thread Affinity | |
204 | ||
205 | A poll mode driver (pmd) thread handles the I/O of all DPDK | |
206 | interfaces assigned to it. A pmd thread shall poll the ports | |
207 | for incoming packets, switch the packets and send to tx port. | |
208 | pmd thread is CPU bound, and needs to be affinitized to isolated | |
209 | cores for optimum performance. | |
210 | ||
211 | By setting a bit in the mask, a pmd thread is created and pinned | |
212 | to the corresponding CPU core. e.g. to run a pmd thread on core 2 | |
213 | ||
214 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=4` | |
215 | ||
216 | Note: pmd thread on a NUMA node is only created if there is | |
217 | at least one DPDK interface from that NUMA node added to OVS. | |
218 | ||
219 | * Qemu vCPU thread Affinity | |
220 | ||
221 | A VM performing simple packet forwarding or running complex packet | |
222 | pipelines has to ensure that the vCPU threads performing the work has | |
223 | as much CPU occupancy as possible. | |
224 | ||
225 | Example: On a multicore VM, multiple QEMU vCPU threads shall be spawned. | |
226 | when the DPDK 'testpmd' application that does packet forwarding | |
227 | is invoked, 'taskset' cmd should be used to affinitize the vCPU threads | |
228 | to the dedicated isolated cores on the host system. | |
229 | ||
230 | ### 4.2 Multiple poll mode driver threads | |
231 | ||
232 | With pmd multi-threading support, OVS creates one pmd thread | |
233 | for each NUMA node by default. However, it can be seen that in cases | |
234 | where there are multiple ports/rxq's producing traffic, performance | |
235 | can be improved by creating multiple pmd threads running on separate | |
236 | cores. These pmd threads can then share the workload by each being | |
237 | responsible for different ports/rxq's. Assignment of ports/rxq's to | |
238 | pmd threads is done automatically. | |
239 | ||
240 | A set bit in the mask means a pmd thread is created and pinned | |
241 | to the corresponding CPU core. e.g. to run pmd threads on core 1 and 2 | |
242 | ||
243 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=6` | |
244 | ||
245 | For example, when using dpdk and dpdkvhostuser ports in a bi-directional | |
246 | VM loopback as shown below, spreading the workload over 2 or 4 pmd | |
247 | threads shows significant improvements as there will be more total CPU | |
248 | occupancy available. | |
249 | ||
250 | NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1 | |
251 | ||
81acebda | 252 | ### 4.3 DPDK physical port Rx Queues |
c9b9d6df BB |
253 | |
254 | `ovs-vsctl set Interface <DPDK interface> options:n_rxq=<integer>` | |
255 | ||
81acebda | 256 | The command above sets the number of rx queues for DPDK physical interface. |
c9b9d6df | 257 | The rx queues are assigned to pmd threads on the same NUMA node in a |
81acebda | 258 | round-robin fashion. |
c9b9d6df BB |
259 | |
260 | ### 4.4 Exact Match Cache | |
261 | ||
262 | Each pmd thread contains one EMC. After initial flow setup in the | |
263 | datapath, the EMC contains a single table and provides the lowest level | |
264 | (fastest) switching for DPDK ports. If there is a miss in the EMC then | |
265 | the next level where switching will occur is the datapath classifier. | |
266 | Missing in the EMC and looking up in the datapath classifier incurs a | |
267 | significant performance penalty. If lookup misses occur in the EMC | |
268 | because it is too small to handle the number of flows, its size can | |
269 | be increased. The EMC size can be modified by editing the define | |
270 | EM_FLOW_HASH_SHIFT in lib/dpif-netdev.c. | |
271 | ||
272 | As mentioned above an EMC is per pmd thread. So an alternative way of | |
273 | increasing the aggregate amount of possible flow entries in EMC and | |
274 | avoiding datapath classifier lookups is to have multiple pmd threads | |
275 | running. This can be done as described in section 4.2. | |
276 | ||
277 | ### 4.5 Rx Mergeable buffers | |
278 | ||
279 | Rx Mergeable buffers is a virtio feature that allows chaining of multiple | |
280 | virtio descriptors to handle large packet sizes. As such, large packets | |
281 | are handled by reserving and chaining multiple free descriptors | |
282 | together. Mergeable buffer support is negotiated between the virtio | |
283 | driver and virtio device and is supported by the DPDK vhost library. | |
284 | This behavior is typically supported and enabled by default, however | |
285 | in the case where the user knows that rx mergeable buffers are not needed | |
286 | i.e. jumbo frames are not needed, it can be forced off by adding | |
287 | mrg_rxbuf=off to the QEMU command line options. By not reserving multiple | |
288 | chains of descriptors it will make more individual virtio descriptors | |
289 | available for rx to the guest using dpdkvhost ports and this can improve | |
290 | performance. | |
291 | ||
292 | ## <a name="ovstc"></a> 5. OVS Testcases | |
293 | ### 5.1 PHY-VM-PHY [VHOST LOOPBACK] | |
294 | ||
295 | The section 5.2 in INSTALL.DPDK guide lists steps for PVP loopback testcase | |
296 | and packet forwarding using DPDK testpmd application in the Guest VM. | |
297 | For users wanting to do packet forwarding using kernel stack below are the steps. | |
298 | ||
299 | ``` | |
300 | ifconfig eth1 1.1.1.2/24 | |
301 | ifconfig eth2 1.1.2.2/24 | |
302 | systemctl stop firewalld.service | |
303 | systemctl stop iptables.service | |
304 | sysctl -w net.ipv4.ip_forward=1 | |
305 | sysctl -w net.ipv4.conf.all.rp_filter=0 | |
306 | sysctl -w net.ipv4.conf.eth1.rp_filter=0 | |
307 | sysctl -w net.ipv4.conf.eth2.rp_filter=0 | |
308 | route add -net 1.1.2.0/24 eth2 | |
309 | route add -net 1.1.1.0/24 eth1 | |
310 | arp -s 1.1.2.99 DE:AD:BE:EF:CA:FE | |
311 | arp -s 1.1.1.99 DE:AD:BE:EF:CA:EE | |
312 | ``` | |
313 | ||
314 | ### 5.2 PHY-VM-PHY [IVSHMEM] | |
315 | ||
316 | The steps (1-5) in 3.3 section of INSTALL.DPDK guide will create & initialize DB, | |
317 | start vswitchd and add dpdk devices to bridge br0. | |
318 | ||
319 | 1. Add DPDK ring port to the bridge | |
320 | ||
321 | ``` | |
322 | ovs-vsctl add-port br0 dpdkr0 -- set Interface dpdkr0 type=dpdkr | |
323 | ``` | |
324 | ||
325 | 2. Build modified Qemu (Qemu-2.2.1 + ivshmem-qemu-2.2.1.patch) | |
326 | ||
327 | ``` | |
328 | cd /usr/src/ | |
329 | wget http://wiki.qemu.org/download/qemu-2.2.1.tar.bz2 | |
330 | tar -jxvf qemu-2.2.1.tar.bz2 | |
331 | cd /usr/src/qemu-2.2.1 | |
332 | wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/patches/ivshmem-qemu-2.2.1.patch | |
333 | patch -p1 < ivshmem-qemu-2.2.1.patch | |
334 | ./configure --target-list=x86_64-softmmu --enable-debug --extra-cflags='-g' | |
335 | make -j 4 | |
336 | ``` | |
337 | ||
338 | 3. Generate Qemu commandline | |
339 | ||
340 | ``` | |
341 | mkdir -p /usr/src/cmdline_generator | |
342 | cd /usr/src/cmdline_generator | |
343 | wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/cmdline_generator/cmdline_generator.c | |
344 | wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/cmdline_generator/Makefile | |
0a0f39df | 345 | export RTE_SDK=/usr/src/dpdk-16.07 |
c9b9d6df BB |
346 | export RTE_TARGET=x86_64-ivshmem-linuxapp-gcc |
347 | make | |
348 | ./build/cmdline_generator -m -p dpdkr0 XXX | |
349 | cmdline=`cat OVSMEMPOOL` | |
350 | ``` | |
351 | ||
352 | 4. start Guest VM | |
353 | ||
354 | ``` | |
355 | export VM_NAME=ivshmem-vm | |
356 | export QCOW2_IMAGE=/root/CentOS7_x86_64.qcow2 | |
357 | export QEMU_BIN=/usr/src/qemu-2.2.1/x86_64-softmmu/qemu-system-x86_64 | |
358 | ||
359 | taskset 0x20 $QEMU_BIN -cpu host -smp 2,cores=2 -hda $QCOW2_IMAGE -m 4096 --enable-kvm -name $VM_NAME -nographic -vnc :2 -pidfile /tmp/vm1.pid $cmdline | |
360 | ``` | |
361 | ||
362 | 5. Running sample "dpdk ring" app in VM | |
363 | ||
364 | ``` | |
365 | echo 1024 > /proc/sys/vm/nr_hugepages | |
366 | mount -t hugetlbfs nodev /dev/hugepages (if not already mounted) | |
367 | ||
368 | # Build the DPDK ring application in the VM | |
0a0f39df | 369 | export RTE_SDK=/root/dpdk-16.07 |
c9b9d6df BB |
370 | export RTE_TARGET=x86_64-ivshmem-linuxapp-gcc |
371 | make | |
372 | ||
373 | # Run dpdkring application | |
374 | ./build/dpdkr -c 1 -n 4 -- -n 0 | |
375 | where "-n 0" refers to ring '0' i.e dpdkr0 | |
376 | ``` | |
377 | ||
8a8b9c4f BB |
378 | ### 5.3 PHY-VM-PHY [VHOST MULTIQUEUE] |
379 | ||
380 | The steps (1-5) in 3.3 section of [INSTALL DPDK] guide will create & initialize DB, | |
381 | start vswitchd and add dpdk devices to bridge br0. | |
382 | ||
383 | 1. Configure PMD and RXQs. For example set no. of dpdk port rx queues to atleast 2. | |
384 | The number of rx queues at vhost-user interface gets automatically configured after | |
385 | virtio device connection and doesn't need manual configuration. | |
386 | ||
387 | ``` | |
388 | ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=c | |
389 | ovs-vsctl set Interface dpdk0 options:n_rxq=2 | |
390 | ovs-vsctl set Interface dpdk1 options:n_rxq=2 | |
391 | ``` | |
392 | ||
393 | 2. Instantiate Guest VM using Qemu cmdline | |
394 | ||
395 | Guest Configuration | |
396 | ||
397 | ``` | |
398 | | configuration | values | comments | |
399 | |----------------------|--------|----------------- | |
400 | | qemu version | 2.5.0 | | |
401 | | qemu thread affinity |2 cores | taskset 0x30 | |
402 | | memory | 4GB | - | |
403 | | cores | 2 | - | |
404 | | Qcow2 image |Fedora22| - | |
405 | | multiqueue | on | - | |
406 | ``` | |
407 | ||
408 | Instantiate Guest | |
409 | ||
410 | ``` | |
411 | export VM_NAME=vhost-vm | |
412 | export GUEST_MEM=4096M | |
413 | export QCOW2_IMAGE=/root/Fedora22_x86_64.qcow2 | |
414 | export VHOST_SOCK_DIR=/usr/local/var/run/openvswitch | |
415 | ||
416 | taskset 0x30 qemu-system-x86_64 -cpu host -smp 2,cores=2 -drive file=$QCOW2_IMAGE -m 4096M --enable-kvm -name $VM_NAME -nographic -object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on -numa node,memdev=mem -mem-prealloc -chardev socket,id=char1,path=$VHOST_SOCK_DIR/dpdkvhostuser0 -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=2 -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mq=on,vectors=6 -chardev socket,id=char2,path=$VHOST_SOCK_DIR/dpdkvhostuser1 -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=2 -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=6 | |
417 | ``` | |
418 | ||
419 | Note: Queue value above should match the queues configured in OVS, The vector value | |
420 | should be set to 'no. of queues x 2 + 2'. | |
421 | ||
422 | 3. Guest interface configuration | |
423 | ||
424 | Assuming there are 2 interfaces in the guest named eth0, eth1 check the channel | |
425 | configuration and set the number of combined channels to 2 for virtio devices. | |
426 | More information can be found in [Vhost walkthrough] section. | |
427 | ||
428 | ``` | |
429 | ethtool -l eth0 | |
430 | ethtool -L eth0 combined 2 | |
431 | ethtool -L eth1 combined 2 | |
432 | ``` | |
433 | ||
434 | 4. Kernel Packet forwarding | |
435 | ||
436 | Configure IP and enable interfaces | |
437 | ||
438 | ``` | |
439 | ifconfig eth0 5.5.5.1/24 up | |
440 | ifconfig eth1 90.90.90.1/24 up | |
441 | ``` | |
442 | ||
443 | Configure IP forwarding and add route entries | |
444 | ||
445 | ``` | |
446 | sysctl -w net.ipv4.ip_forward=1 | |
447 | sysctl -w net.ipv4.conf.all.rp_filter=0 | |
448 | sysctl -w net.ipv4.conf.eth0.rp_filter=0 | |
449 | sysctl -w net.ipv4.conf.eth1.rp_filter=0 | |
450 | ip route add 2.1.1.0/24 dev eth1 | |
451 | route add default gw 2.1.1.2 eth1 | |
452 | route add default gw 90.90.90.90 eth1 | |
453 | arp -s 90.90.90.90 DE:AD:BE:EF:CA:FE | |
454 | arp -s 2.1.1.2 DE:AD:BE:EF:CA:FA | |
455 | ``` | |
456 | ||
457 | Check traffic on multiple queues | |
458 | ||
459 | ``` | |
460 | cat /proc/interrupts | grep virtio | |
461 | ``` | |
462 | ||
c9b9d6df | 463 | ## <a name="vhost"></a> 6. Vhost Walkthrough |
c9b9d6df BB |
464 | ### 6.1 vhost-user |
465 | ||
466 | - Prerequisites: | |
467 | ||
468 | QEMU version >= 2.2 | |
469 | ||
470 | - Adding vhost-user ports to Switch | |
471 | ||
472 | Unlike DPDK ring ports, DPDK vhost-user ports can have arbitrary names, | |
473 | except that forward and backward slashes are prohibited in the names. | |
474 | ||
475 | For vhost-user, the name of the port type is `dpdkvhostuser` | |
476 | ||
477 | ``` | |
478 | ovs-vsctl add-port br0 vhost-user-1 -- set Interface vhost-user-1 | |
479 | type=dpdkvhostuser | |
480 | ``` | |
481 | ||
482 | This action creates a socket located at | |
483 | `/usr/local/var/run/openvswitch/vhost-user-1`, which you must provide | |
484 | to your VM on the QEMU command line. More instructions on this can be | |
485 | found in the next section "Adding vhost-user ports to VM" | |
486 | ||
487 | Note: If you wish for the vhost-user sockets to be created in a | |
488 | sub-directory of `/usr/local/var/run/openvswitch`, you may specify | |
489 | this directory in the ovsdb like so: | |
490 | ||
491 | `./utilities/ovs-vsctl --no-wait \ | |
492 | set Open_vSwitch . other_config:vhost-sock-dir=subdir` | |
493 | ||
494 | - Adding vhost-user ports to VM | |
495 | ||
496 | 1. Configure sockets | |
497 | ||
498 | Pass the following parameters to QEMU to attach a vhost-user device: | |
499 | ||
500 | ``` | |
501 | -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 | |
502 | -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce | |
503 | -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 | |
504 | ``` | |
505 | ||
506 | where vhost-user-1 is the name of the vhost-user port added | |
507 | to the switch. | |
508 | Repeat the above parameters for multiple devices, changing the | |
509 | chardev path and id as necessary. Note that a separate and different | |
510 | chardev path needs to be specified for each vhost-user device. For | |
511 | example you have a second vhost-user port named 'vhost-user-2', you | |
512 | append your QEMU command line with an additional set of parameters: | |
513 | ||
514 | ``` | |
515 | -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 | |
516 | -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce | |
517 | -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2 | |
518 | ``` | |
519 | ||
520 | 2. Configure huge pages. | |
521 | ||
522 | QEMU must allocate the VM's memory on hugetlbfs. vhost-user ports access | |
523 | a virtio-net device's virtual rings and packet buffers mapping the VM's | |
524 | physical memory on hugetlbfs. To enable vhost-user ports to map the VM's | |
525 | memory into their process address space, pass the following parameters | |
526 | to QEMU: | |
527 | ||
528 | ``` | |
529 | -object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages, | |
530 | share=on -numa node,memdev=mem -mem-prealloc | |
531 | ``` | |
532 | ||
533 | 3. Enable multiqueue support(OPTIONAL) | |
534 | ||
81acebda IM |
535 | QEMU needs to be configured to use multiqueue. |
536 | The $q below is the number of queues. | |
c9b9d6df BB |
537 | The $v is the number of vectors, which is '$q x 2 + 2'. |
538 | ||
539 | ``` | |
540 | -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 | |
541 | -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=$q | |
542 | -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=$v | |
543 | ``` | |
544 | ||
81acebda IM |
545 | The vhost-user interface will be automatically reconfigured with required |
546 | number of rx and tx queues after connection of virtio device. | |
547 | Manual configuration of `n_rxq` is not supported because OVS will work | |
548 | properly only if `n_rxq` will match number of queues configured in QEMU. | |
549 | ||
c9b9d6df BB |
550 | A least 2 PMDs should be configured for the vswitch when using multiqueue. |
551 | Using a single PMD will cause traffic to be enqueued to the same vhost | |
552 | queue rather than being distributed among different vhost queues for a | |
553 | vhost-user interface. | |
554 | ||
555 | If traffic destined for a VM configured with multiqueue arrives to the | |
556 | vswitch via a physical DPDK port, then the number of rxqs should also be | |
557 | set to at least 2 for that physical DPDK port. This is required to increase | |
558 | the probability that a different PMD will handle the multiqueue | |
559 | transmission to the guest using a different vhost queue. | |
560 | ||
561 | If one wishes to use multiple queues for an interface in the guest, the | |
562 | driver in the guest operating system must be configured to do so. It is | |
563 | recommended that the number of queues configured be equal to '$q'. | |
564 | ||
565 | For example, this can be done for the Linux kernel virtio-net driver with: | |
566 | ||
567 | ``` | |
568 | ethtool -L <DEV> combined <$q> | |
569 | ``` | |
570 | where `-L`: Changes the numbers of channels of the specified network device | |
571 | and `combined`: Changes the number of multi-purpose channels. | |
572 | ||
573 | - VM Configuration with libvirt | |
574 | ||
575 | * change the user/group, access control policty and restart libvirtd. | |
576 | ||
577 | - In `/etc/libvirt/qemu.conf` add/edit the following lines | |
578 | ||
579 | ``` | |
580 | user = "root" | |
581 | group = "root" | |
582 | ``` | |
583 | ||
584 | - Disable SELinux or set to permissive mode | |
585 | ||
586 | `setenforce 0` | |
587 | ||
588 | - Restart the libvirtd process, For example, on Fedora | |
589 | ||
590 | `systemctl restart libvirtd.service` | |
591 | ||
592 | * Instantiate the VM | |
593 | ||
594 | - Copy the xml configuration from [Guest VM using libvirt] in to workspace. | |
595 | ||
596 | - Start the VM. | |
597 | ||
598 | `virsh create demovm.xml` | |
599 | ||
600 | - Connect to the guest console | |
601 | ||
602 | `virsh console demovm` | |
603 | ||
604 | * VM configuration | |
605 | ||
606 | The demovm xml configuration is aimed at achieving out of box performance | |
607 | on VM. | |
608 | ||
609 | - The vcpus are pinned to the cores of the CPU socket 0 using vcpupin. | |
610 | ||
611 | - Configure NUMA cell and memory shared using memAccess='shared'. | |
612 | ||
613 | - Disable mrg_rxbuf='off'. | |
614 | ||
615 | Note: For information on libvirt and further tuning refer [libvirt]. | |
616 | ||
c9b9d6df BB |
617 | ### 6.3 DPDK backend inside VM |
618 | ||
619 | Please note that additional configuration is required if you want to run | |
620 | ovs-vswitchd with DPDK backend inside a QEMU virtual machine. Ovs-vswitchd | |
621 | creates separate DPDK TX queues for each CPU core available. This operation | |
622 | fails inside QEMU virtual machine because, by default, VirtIO NIC provided | |
623 | to the guest is configured to support only single TX queue and single RX | |
624 | queue. To change this behavior, you need to turn on 'mq' (multiqueue) | |
625 | property of all virtio-net-pci devices emulated by QEMU and used by DPDK. | |
626 | You may do it manually (by changing QEMU command line) or, if you use | |
627 | Libvirt, by adding the following string: | |
628 | ||
629 | `<driver name='vhost' queues='N'/>` | |
630 | ||
631 | to <interface> sections of all network devices used by DPDK. Parameter 'N' | |
632 | determines how many queues can be used by the guest.This may not work with | |
633 | old versions of QEMU found in some distros and need Qemu version >= 2.2. | |
634 | ||
635 | ## <a name="qos"></a> 7. QOS | |
636 | ||
637 | Here is an example on QOS usage. | |
638 | Assuming you have a vhost-user port transmitting traffic consisting of | |
639 | packets of size 64 bytes, the following command would limit the egress | |
640 | transmission rate of the port to ~1,000,000 packets per second | |
641 | ||
642 | `ovs-vsctl set port vhost-user0 qos=@newqos -- --id=@newqos create qos | |
643 | type=egress-policer other-config:cir=46000000 other-config:cbs=2048` | |
644 | ||
645 | To examine the QoS configuration of the port: | |
646 | ||
647 | `ovs-appctl -t ovs-vswitchd qos/show vhost-user0` | |
648 | ||
649 | To clear the QoS configuration from the port and ovsdb use the following: | |
650 | ||
651 | `ovs-vsctl destroy QoS vhost-user0 -- clear Port vhost-user0 qos` | |
652 | ||
653 | For more details regarding egress-policer parameters please refer to the | |
654 | vswitch.xml. | |
655 | ||
656 | ## <a name="rl"></a> 8. Rate Limiting | |
657 | ||
658 | Here is an example on Ingress Policing usage. | |
659 | Assuming you have a vhost-user port receiving traffic consisting of | |
660 | packets of size 64 bytes, the following command would limit the reception | |
661 | rate of the port to ~1,000,000 packets per second: | |
662 | ||
663 | `ovs-vsctl set interface vhost-user0 ingress_policing_rate=368000 | |
664 | ingress_policing_burst=1000` | |
665 | ||
666 | To examine the ingress policer configuration of the port: | |
667 | ||
668 | `ovs-vsctl list interface vhost-user0` | |
669 | ||
670 | To clear the ingress policer configuration from the port use the following: | |
671 | ||
672 | `ovs-vsctl set interface vhost-user0 ingress_policing_rate=0` | |
673 | ||
674 | For more details regarding ingress-policer see the vswitch.xml. | |
675 | ||
9fd39370 SC |
676 | ## <a name="fc"></a> 9. Flow control. |
677 | Flow control can be enabled only on DPDK physical ports. | |
678 | To enable flow control support at tx side while adding a port, add the | |
679 | 'tx-flow-ctrl' option to the 'ovs-vsctl add-port' as in the eg: below. | |
680 | ||
681 | ``` | |
682 | ovs-vsctl add-port br0 dpdk0 -- \ | |
683 | set Interface dpdk0 type=dpdk options:tx-flow-ctrl=true | |
684 | ``` | |
685 | ||
686 | Similarly to enable rx flow control, | |
687 | ||
688 | ``` | |
689 | ovs-vsctl add-port br0 dpdk0 -- \ | |
690 | set Interface dpdk0 type=dpdk options:rx-flow-ctrl=true | |
691 | ``` | |
692 | ||
693 | And to enable the flow control auto-negotiation, | |
694 | ||
695 | ``` | |
696 | ovs-vsctl add-port br0 dpdk0 -- \ | |
697 | set Interface dpdk0 type=dpdk options:flow-ctrl-autoneg=true | |
698 | ``` | |
699 | ||
700 | To turn ON the tx flow control at run time(After the port is being added | |
701 | to OVS), the command-line input will be, | |
702 | ||
703 | `ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=true` | |
704 | ||
705 | The flow control parameters can be turned off by setting 'false' to the | |
706 | respective parameter. To disable the flow control at tx side, | |
707 | ||
708 | `ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false` | |
709 | ||
4b88d678 CL |
710 | ## <a name="pdump"></a> 10. Pdump |
711 | ||
712 | Pdump allows you to listen on DPDK ports and view the traffic that is | |
713 | passing on them. To use this utility, one must have libpcap installed | |
714 | on the system. Furthermore, DPDK must be built with CONFIG_RTE_LIBRTE_PDUMP=y | |
715 | and CONFIG_RTE_LIBRTE_PMD_PCAP=y. | |
716 | ||
717 | To use pdump, simply launch OVS as usual. Then, navigate to the 'app/pdump' | |
718 | directory in DPDK, 'make' the application and run like so: | |
719 | ||
720 | ``` | |
721 | sudo ./build/app/dpdk_pdump -- | |
722 | --pdump port=0,queue=0,rx-dev=/tmp/pkts.pcap | |
723 | --server-socket-path=/usr/local/var/run/openvswitch | |
724 | ``` | |
725 | ||
726 | The above command captures traffic received on queue 0 of port 0 and stores | |
727 | it in /tmp/pkts.pcap. Other combinations of port numbers, queues numbers and | |
728 | pcap locations are of course also available to use. 'server-socket-path' must | |
729 | be set to the value of ovs_rundir() which typically resolves to | |
730 | '/usr/local/var/run/openvswitch'. | |
731 | More information on the pdump app and its usage can be found in the below link. | |
732 | ||
733 | http://dpdk.org/doc/guides/sample_app_ug/pdump.html | |
734 | ||
735 | Many tools are available to view the contents of the pcap file. Once example is | |
736 | tcpdump. Issue the following command to view the contents of 'pkts.pcap': | |
737 | ||
738 | `tcpdump -r pkts.pcap` | |
739 | ||
740 | A performance decrease is expected when using a monitoring application like | |
741 | the DPDK pdump app. | |
742 | ||
0072e931 MK |
743 | ## <a name="jumbo"></a> 11. Jumbo Frames |
744 | ||
745 | By default, DPDK ports are configured with standard Ethernet MTU (1500B). To | |
746 | enable Jumbo Frames support for a DPDK port, change the Interface's `mtu_request` | |
747 | attribute to a sufficiently large value. | |
748 | ||
749 | e.g. Add a DPDK Phy port with MTU of 9000: | |
750 | ||
751 | `ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk -- set Interface dpdk0 mtu_request=9000` | |
752 | ||
753 | e.g. Change the MTU of an existing port to 6200: | |
754 | ||
755 | `ovs-vsctl set Interface dpdk0 mtu_request=6200` | |
756 | ||
757 | When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are | |
758 | increased, such that a full Jumbo Frame of a specific size may be accommodated | |
759 | within a single mbuf segment. | |
760 | ||
761 | Jumbo frame support has been validated against 9728B frames (largest frame size | |
762 | supported by Fortville NIC), using the DPDK `i40e` driver, but larger frames | |
763 | (particularly in use cases involving East-West traffic only), and other DPDK NIC | |
764 | drivers may be supported. | |
765 | ||
d3254e21 | 766 | ### 11.1 vHost Ports and Jumbo Frames |
0072e931 MK |
767 | |
768 | Some additional configuration is needed to take advantage of jumbo frames with | |
769 | vhost ports: | |
770 | ||
d3254e21 BB |
771 | 1. `mergeable buffers` must be enabled for vHost ports, as demonstrated in |
772 | the QEMU command line snippet below: | |
0072e931 | 773 | |
d3254e21 BB |
774 | ``` |
775 | '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \' | |
776 | '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on' | |
777 | ``` | |
0072e931 | 778 | |
d3254e21 BB |
779 | 2. Where virtio devices are bound to the Linux kernel driver in a guest |
780 | environment (i.e. interfaces are not bound to an in-guest DPDK driver), | |
781 | the MTU of those logical network interfaces must also be increased to a | |
782 | sufficiently large value. This avoids segmentation of Jumbo Frames | |
783 | received in the guest. Note that 'MTU' refers to the length of the IP | |
784 | packet only, and not that of the entire frame. | |
0072e931 | 785 | |
d3254e21 BB |
786 | To calculate the exact MTU of a standard IPv4 frame, subtract the L2 |
787 | header and CRC lengths (i.e. 18B) from the max supported frame size. | |
788 | So, to set the MTU for a 9018B Jumbo Frame: | |
0072e931 | 789 | |
d3254e21 BB |
790 | ``` |
791 | ifconfig eth1 mtu 9000 | |
792 | ``` | |
0072e931 MK |
793 | |
794 | ## <a name="vsperf"></a> 12. Vsperf | |
c9b9d6df BB |
795 | |
796 | Vsperf project goal is to develop vSwitch test framework that can be used to | |
797 | validate the suitability of different vSwitch implementations in a Telco deployment | |
798 | environment. More information can be found in below link. | |
799 | ||
800 | https://wiki.opnfv.org/display/vsperf/VSperf+Home | |
801 | ||
802 | ||
803 | Bug Reporting: | |
804 | -------------- | |
805 | ||
806 | Please report problems to bugs@openvswitch.org. | |
807 | ||
808 | ||
809 | [INSTALL.userspace.md]:INSTALL.userspace.md | |
810 | [INSTALL.md]:INSTALL.md | |
811 | [DPDK Linux GSG]: http://www.dpdk.org/doc/guides/linux_gsg/build_dpdk.html#binding-and-unbinding-network-ports-to-from-the-igb-uioor-vfio-modules | |
812 | [DPDK Docs]: http://dpdk.org/doc | |
813 | [libvirt]: http://libvirt.org/formatdomain.html | |
814 | [Guest VM using libvirt]: INSTALL.DPDK.md#ovstc | |
8a8b9c4f | 815 | [Vhost walkthrough]: INSTALL.DPDK.md#vhost |
c9b9d6df BB |
816 | [INSTALL DPDK]: INSTALL.DPDK.md#build |
817 | [INSTALL OVS]: INSTALL.DPDK.md#build |