]>
Commit | Line | Data |
---|---|---|
c9b9d6df | 1 | OVS DPDK ADVANCED INSTALL GUIDE |
0072e931 | 2 | =============================== |
c9b9d6df BB |
3 | |
4 | ## Contents | |
5 | ||
6 | 1. [Overview](#overview) | |
7 | 2. [Building Shared Library](#build) | |
8 | 3. [System configuration](#sysconf) | |
9 | 4. [Performance Tuning](#perftune) | |
10 | 5. [OVS Testcases](#ovstc) | |
11 | 6. [Vhost Walkthrough](#vhost) | |
12 | 7. [QOS](#qos) | |
13 | 8. [Rate Limiting](#rl) | |
9fd39370 | 14 | 9. [Flow Control](#fc) |
4b88d678 | 15 | 10. [Pdump](#pdump) |
0072e931 MK |
16 | 11. [Jumbo Frames](#jumbo) |
17 | 12. [Vsperf](#vsperf) | |
c9b9d6df BB |
18 | |
19 | ## <a name="overview"></a> 1. Overview | |
20 | ||
21 | The Advanced Install Guide explains how to improve OVS performance using | |
22 | DPDK datapath. This guide also provides information on tuning, system configuration, | |
23 | troubleshooting, static code analysis and testcases. | |
24 | ||
25 | ## <a name="build"></a> 2. Building Shared Library | |
26 | ||
27 | DPDK can be built as static or shared library and shall be linked by applications | |
28 | using DPDK datapath. The section lists steps to build shared library and dynamically | |
29 | link DPDK against OVS. | |
30 | ||
31 | Note: Minor performance loss is seen with OVS when using shared DPDK library as | |
32 | compared to static library. | |
33 | ||
34 | Check section [INSTALL DPDK], [INSTALL OVS] of INSTALL.DPDK on download instructions | |
35 | for DPDK and OVS. | |
36 | ||
37 | * Configure the DPDK library | |
38 | ||
39 | Set `CONFIG_RTE_BUILD_SHARED_LIB=y` in `config/common_base` | |
40 | to generate shared DPDK library | |
41 | ||
42 | ||
43 | * Build and install DPDK | |
44 | ||
45 | For Default install (without IVSHMEM), set `export DPDK_TARGET=x86_64-native-linuxapp-gcc` | |
46 | For IVSHMEM case, set `export DPDK_TARGET=x86_64-ivshmem-linuxapp-gcc` | |
47 | ||
48 | ``` | |
0a0f39df | 49 | export DPDK_DIR=/usr/src/dpdk-16.07 |
c9b9d6df BB |
50 | export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET |
51 | make install T=$DPDK_TARGET DESTDIR=install | |
52 | ``` | |
53 | ||
54 | * Build, Install and Setup OVS. | |
55 | ||
56 | Export the DPDK shared library location and setup OVS as listed in | |
57 | section 3.3 of INSTALL.DPDK. | |
58 | ||
59 | `export LD_LIBRARY_PATH=$DPDK_DIR/x86_64-native-linuxapp-gcc/lib` | |
60 | ||
61 | ## <a name="sysconf"></a> 3. System Configuration | |
62 | ||
63 | To achieve optimal OVS performance, the system can be configured and that includes | |
64 | BIOS tweaks, Grub cmdline additions, better understanding of NUMA nodes and | |
65 | apt selection of PCIe slots for NIC placement. | |
66 | ||
67 | ### 3.1 Recommended BIOS settings | |
68 | ||
69 | ``` | |
70 | | Settings | values | comments | |
71 | |---------------------------|-----------|----------- | |
72 | | C3 power state | Disabled | - | |
73 | | C6 power state | Disabled | - | |
74 | | MLC Streamer | Enabled | - | |
75 | | MLC Spacial prefetcher | Enabled | - | |
76 | | DCU Data prefetcher | Enabled | - | |
77 | | DCA | Enabled | - | |
78 | | CPU power and performance | Performance - | |
79 | | Memory RAS and perf | | - | |
80 | config-> NUMA optimized | Enabled | - | |
81 | ``` | |
82 | ||
83 | ### 3.2 PCIe Slot Selection | |
84 | ||
85 | The fastpath performance also depends on factors like the NIC placement, | |
86 | Channel speeds between PCIe slot and CPU, proximity of PCIe slot to the CPU | |
87 | cores running DPDK application. Listed below are the steps to identify | |
88 | right PCIe slot. | |
89 | ||
90 | - Retrieve host details using cmd `dmidecode -t baseboard | grep "Product Name"` | |
91 | - Download the technical specification for Product listed eg: S2600WT2. | |
92 | - Check the Product Architecture Overview on the Riser slot placement, | |
93 | CPU sharing info and also PCIe channel speeds. | |
94 | ||
95 | example: On S2600WT, CPU1 and CPU2 share Riser Slot 1 with Channel speed between | |
96 | CPU1 and Riser Slot1 at 32GB/s, CPU2 and Riser Slot1 at 16GB/s. Running DPDK app | |
97 | on CPU1 cores and NIC inserted in to Riser card Slots will optimize OVS performance | |
98 | in this case. | |
99 | ||
100 | - Check the Riser Card #1 - Root Port mapping information, on the available slots | |
101 | and individual bus speeds. In S2600WT slot 1, slot 2 has high bus speeds and are | |
102 | potential slots for NIC placement. | |
103 | ||
104 | ### 3.3 Advanced Hugepage setup | |
105 | ||
106 | Allocate and mount 1G Huge pages: | |
107 | ||
108 | - For persistent allocation of huge pages, add the following options to the kernel bootline | |
109 | ||
110 | Add `default_hugepagesz=1GB hugepagesz=1G hugepages=N` | |
111 | ||
112 | For platforms supporting multiple huge page sizes, Add options | |
113 | ||
114 | `default_hugepagesz=<size> hugepagesz=<size> hugepages=N` | |
115 | where 'N' = Number of huge pages requested, 'size' = huge page size, | |
116 | optional suffix [kKmMgG] | |
117 | ||
118 | - For run-time allocation of huge pages | |
119 | ||
120 | `echo N > /sys/devices/system/node/nodeX/hugepages/hugepages-1048576kB/nr_hugepages` | |
121 | where 'N' = Number of huge pages requested, 'X' = NUMA Node | |
122 | ||
123 | Note: For run-time allocation of 1G huge pages, Contiguous Memory Allocator(CONFIG_CMA) | |
124 | has to be supported by kernel, check your Linux distro. | |
125 | ||
126 | - Mount huge pages | |
127 | ||
128 | `mount -t hugetlbfs -o pagesize=1G none /dev/hugepages` | |
129 | ||
130 | Note: Mount hugepages if not already mounted by default. | |
131 | ||
132 | ### 3.4 Enable Hyperthreading | |
133 | ||
134 | Requires BIOS changes | |
135 | ||
136 | With HT/SMT enabled, A Physical core appears as two logical cores. | |
137 | SMT can be utilized to spawn worker threads on logical cores of the same | |
138 | physical core there by saving additional cores. | |
139 | ||
140 | With DPDK, When pinning pmd threads to logical cores, care must be taken | |
141 | to set the correct bits in the pmd-cpu-mask to ensure that the pmd threads are | |
142 | pinned to SMT siblings. | |
143 | ||
144 | Example System configuration: | |
145 | Dual socket Machine, 2x 10 core processors, HT enabled, 40 logical cores | |
146 | ||
147 | To use two logical cores which share the same physical core for pmd threads, | |
148 | the following command can be used to identify a pair of logical cores. | |
149 | ||
150 | `cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list`, where N is the | |
151 | logical core number. | |
152 | ||
153 | In this example, it would show that cores 1 and 21 share the same physical core. | |
154 | The pmd-cpu-mask to enable two pmd threads running on these two logical cores | |
155 | (one physical core) is. | |
156 | ||
157 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=100002` | |
158 | ||
159 | ### 3.5 Isolate cores | |
160 | ||
161 | 'isolcpus' option can be used to isolate cores from the linux scheduler. | |
162 | The isolated cores can then be used to dedicatedly run HPC applications/threads. | |
163 | This helps in better application performance due to zero context switching and | |
164 | minimal cache thrashing. To run platform logic on core 0 and isolate cores | |
165 | between 1 and 19 from scheduler, Add `isolcpus=1-19` to GRUB cmdline. | |
166 | ||
167 | Note: It has been verified that core isolation has minimal advantage due to | |
168 | mature Linux scheduler in some circumstances. | |
169 | ||
170 | ### 3.6 NUMA/Cluster on Die | |
171 | ||
172 | Ideally inter NUMA datapaths should be avoided where possible as packets | |
173 | will go across QPI and there may be a slight performance penalty when | |
174 | compared with intra NUMA datapaths. On Intel Xeon Processor E5 v3, | |
175 | Cluster On Die is introduced on models that have 10 cores or more. | |
176 | This makes it possible to logically split a socket into two NUMA regions | |
177 | and again it is preferred where possible to keep critical datapaths | |
178 | within the one cluster. | |
179 | ||
180 | It is good practice to ensure that threads that are in the datapath are | |
181 | pinned to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs | |
182 | responsible for forwarding. If DPDK is built with | |
183 | CONFIG_RTE_LIBRTE_VHOST_NUMA=y, vHost User ports automatically | |
184 | detect the NUMA socket of the QEMU vCPUs and will be serviced by a PMD | |
185 | from the same node provided a core on this node is enabled in the | |
435aaddd | 186 | pmd-cpu-mask. libnuma packages are required for this feature. |
c9b9d6df BB |
187 | |
188 | ### 3.7 Compiler Optimizations | |
189 | ||
190 | The default compiler optimization level is '-O2'. Changing this to | |
191 | more aggressive compiler optimization such as '-O3 -march=native' | |
192 | with gcc(verified on 5.3.1) can produce performance gains though not | |
193 | siginificant. '-march=native' will produce optimized code on local machine | |
194 | and should be used when SW compilation is done on Testbed. | |
195 | ||
196 | ## <a name="perftune"></a> 4. Performance Tuning | |
197 | ||
198 | ### 4.1 Affinity | |
199 | ||
200 | For superior performance, DPDK pmd threads and Qemu vCPU threads | |
201 | needs to be affinitized accordingly. | |
202 | ||
203 | * PMD thread Affinity | |
204 | ||
205 | A poll mode driver (pmd) thread handles the I/O of all DPDK | |
206 | interfaces assigned to it. A pmd thread shall poll the ports | |
207 | for incoming packets, switch the packets and send to tx port. | |
208 | pmd thread is CPU bound, and needs to be affinitized to isolated | |
209 | cores for optimum performance. | |
210 | ||
211 | By setting a bit in the mask, a pmd thread is created and pinned | |
212 | to the corresponding CPU core. e.g. to run a pmd thread on core 2 | |
213 | ||
214 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=4` | |
215 | ||
216 | Note: pmd thread on a NUMA node is only created if there is | |
217 | at least one DPDK interface from that NUMA node added to OVS. | |
218 | ||
219 | * Qemu vCPU thread Affinity | |
220 | ||
221 | A VM performing simple packet forwarding or running complex packet | |
222 | pipelines has to ensure that the vCPU threads performing the work has | |
223 | as much CPU occupancy as possible. | |
224 | ||
225 | Example: On a multicore VM, multiple QEMU vCPU threads shall be spawned. | |
226 | when the DPDK 'testpmd' application that does packet forwarding | |
227 | is invoked, 'taskset' cmd should be used to affinitize the vCPU threads | |
228 | to the dedicated isolated cores on the host system. | |
229 | ||
230 | ### 4.2 Multiple poll mode driver threads | |
231 | ||
232 | With pmd multi-threading support, OVS creates one pmd thread | |
233 | for each NUMA node by default. However, it can be seen that in cases | |
234 | where there are multiple ports/rxq's producing traffic, performance | |
235 | can be improved by creating multiple pmd threads running on separate | |
236 | cores. These pmd threads can then share the workload by each being | |
237 | responsible for different ports/rxq's. Assignment of ports/rxq's to | |
238 | pmd threads is done automatically. | |
239 | ||
240 | A set bit in the mask means a pmd thread is created and pinned | |
241 | to the corresponding CPU core. e.g. to run pmd threads on core 1 and 2 | |
242 | ||
243 | `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=6` | |
244 | ||
245 | For example, when using dpdk and dpdkvhostuser ports in a bi-directional | |
246 | VM loopback as shown below, spreading the workload over 2 or 4 pmd | |
247 | threads shows significant improvements as there will be more total CPU | |
248 | occupancy available. | |
249 | ||
250 | NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1 | |
251 | ||
81acebda | 252 | ### 4.3 DPDK physical port Rx Queues |
c9b9d6df BB |
253 | |
254 | `ovs-vsctl set Interface <DPDK interface> options:n_rxq=<integer>` | |
255 | ||
81acebda | 256 | The command above sets the number of rx queues for DPDK physical interface. |
c9b9d6df | 257 | The rx queues are assigned to pmd threads on the same NUMA node in a |
81acebda | 258 | round-robin fashion. |
c9b9d6df | 259 | |
b685696b CL |
260 | ### 4.4 DPDK Physical Port Queue Sizes |
261 | `ovs-vsctl set Interface dpdk0 options:n_rxq_desc=<integer>` | |
262 | `ovs-vsctl set Interface dpdk0 options:n_txq_desc=<integer>` | |
263 | ||
264 | The command above sets the number of rx/tx descriptors that the NIC | |
265 | associated with dpdk0 will be initialised with. | |
266 | ||
267 | Different 'n_rxq_desc' and 'n_txq_desc' configurations yield different | |
268 | benefits in terms of throughput and latency for different scenarios. | |
269 | Generally, smaller queue sizes can have a positive impact for latency at the | |
270 | expense of throughput. The opposite is often true for larger queue sizes. | |
271 | Note: increasing the number of rx descriptors eg. to 4096 may have a | |
272 | negative impact on performance due to the fact that non-vectorised DPDK rx | |
273 | functions may be used. This is dependant on the driver in use, but is true | |
274 | for the commonly used i40e and ixgbe DPDK drivers. | |
275 | ||
276 | ### 4.5 Exact Match Cache | |
c9b9d6df BB |
277 | |
278 | Each pmd thread contains one EMC. After initial flow setup in the | |
279 | datapath, the EMC contains a single table and provides the lowest level | |
280 | (fastest) switching for DPDK ports. If there is a miss in the EMC then | |
281 | the next level where switching will occur is the datapath classifier. | |
282 | Missing in the EMC and looking up in the datapath classifier incurs a | |
283 | significant performance penalty. If lookup misses occur in the EMC | |
284 | because it is too small to handle the number of flows, its size can | |
285 | be increased. The EMC size can be modified by editing the define | |
286 | EM_FLOW_HASH_SHIFT in lib/dpif-netdev.c. | |
287 | ||
288 | As mentioned above an EMC is per pmd thread. So an alternative way of | |
289 | increasing the aggregate amount of possible flow entries in EMC and | |
290 | avoiding datapath classifier lookups is to have multiple pmd threads | |
291 | running. This can be done as described in section 4.2. | |
292 | ||
b685696b | 293 | ### 4.6 Rx Mergeable buffers |
c9b9d6df BB |
294 | |
295 | Rx Mergeable buffers is a virtio feature that allows chaining of multiple | |
296 | virtio descriptors to handle large packet sizes. As such, large packets | |
297 | are handled by reserving and chaining multiple free descriptors | |
298 | together. Mergeable buffer support is negotiated between the virtio | |
299 | driver and virtio device and is supported by the DPDK vhost library. | |
300 | This behavior is typically supported and enabled by default, however | |
301 | in the case where the user knows that rx mergeable buffers are not needed | |
302 | i.e. jumbo frames are not needed, it can be forced off by adding | |
303 | mrg_rxbuf=off to the QEMU command line options. By not reserving multiple | |
304 | chains of descriptors it will make more individual virtio descriptors | |
305 | available for rx to the guest using dpdkvhost ports and this can improve | |
306 | performance. | |
307 | ||
308 | ## <a name="ovstc"></a> 5. OVS Testcases | |
309 | ### 5.1 PHY-VM-PHY [VHOST LOOPBACK] | |
310 | ||
311 | The section 5.2 in INSTALL.DPDK guide lists steps for PVP loopback testcase | |
312 | and packet forwarding using DPDK testpmd application in the Guest VM. | |
313 | For users wanting to do packet forwarding using kernel stack below are the steps. | |
314 | ||
315 | ``` | |
316 | ifconfig eth1 1.1.1.2/24 | |
317 | ifconfig eth2 1.1.2.2/24 | |
318 | systemctl stop firewalld.service | |
319 | systemctl stop iptables.service | |
320 | sysctl -w net.ipv4.ip_forward=1 | |
321 | sysctl -w net.ipv4.conf.all.rp_filter=0 | |
322 | sysctl -w net.ipv4.conf.eth1.rp_filter=0 | |
323 | sysctl -w net.ipv4.conf.eth2.rp_filter=0 | |
324 | route add -net 1.1.2.0/24 eth2 | |
325 | route add -net 1.1.1.0/24 eth1 | |
326 | arp -s 1.1.2.99 DE:AD:BE:EF:CA:FE | |
327 | arp -s 1.1.1.99 DE:AD:BE:EF:CA:EE | |
328 | ``` | |
329 | ||
330 | ### 5.2 PHY-VM-PHY [IVSHMEM] | |
331 | ||
332 | The steps (1-5) in 3.3 section of INSTALL.DPDK guide will create & initialize DB, | |
333 | start vswitchd and add dpdk devices to bridge br0. | |
334 | ||
335 | 1. Add DPDK ring port to the bridge | |
336 | ||
337 | ``` | |
338 | ovs-vsctl add-port br0 dpdkr0 -- set Interface dpdkr0 type=dpdkr | |
339 | ``` | |
340 | ||
341 | 2. Build modified Qemu (Qemu-2.2.1 + ivshmem-qemu-2.2.1.patch) | |
342 | ||
343 | ``` | |
344 | cd /usr/src/ | |
345 | wget http://wiki.qemu.org/download/qemu-2.2.1.tar.bz2 | |
346 | tar -jxvf qemu-2.2.1.tar.bz2 | |
347 | cd /usr/src/qemu-2.2.1 | |
348 | wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/patches/ivshmem-qemu-2.2.1.patch | |
349 | patch -p1 < ivshmem-qemu-2.2.1.patch | |
350 | ./configure --target-list=x86_64-softmmu --enable-debug --extra-cflags='-g' | |
351 | make -j 4 | |
352 | ``` | |
353 | ||
354 | 3. Generate Qemu commandline | |
355 | ||
356 | ``` | |
357 | mkdir -p /usr/src/cmdline_generator | |
358 | cd /usr/src/cmdline_generator | |
359 | wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/cmdline_generator/cmdline_generator.c | |
360 | wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/cmdline_generator/Makefile | |
0a0f39df | 361 | export RTE_SDK=/usr/src/dpdk-16.07 |
c9b9d6df BB |
362 | export RTE_TARGET=x86_64-ivshmem-linuxapp-gcc |
363 | make | |
364 | ./build/cmdline_generator -m -p dpdkr0 XXX | |
365 | cmdline=`cat OVSMEMPOOL` | |
366 | ``` | |
367 | ||
368 | 4. start Guest VM | |
369 | ||
370 | ``` | |
371 | export VM_NAME=ivshmem-vm | |
372 | export QCOW2_IMAGE=/root/CentOS7_x86_64.qcow2 | |
373 | export QEMU_BIN=/usr/src/qemu-2.2.1/x86_64-softmmu/qemu-system-x86_64 | |
374 | ||
375 | taskset 0x20 $QEMU_BIN -cpu host -smp 2,cores=2 -hda $QCOW2_IMAGE -m 4096 --enable-kvm -name $VM_NAME -nographic -vnc :2 -pidfile /tmp/vm1.pid $cmdline | |
376 | ``` | |
377 | ||
378 | 5. Running sample "dpdk ring" app in VM | |
379 | ||
380 | ``` | |
381 | echo 1024 > /proc/sys/vm/nr_hugepages | |
382 | mount -t hugetlbfs nodev /dev/hugepages (if not already mounted) | |
383 | ||
384 | # Build the DPDK ring application in the VM | |
0a0f39df | 385 | export RTE_SDK=/root/dpdk-16.07 |
c9b9d6df BB |
386 | export RTE_TARGET=x86_64-ivshmem-linuxapp-gcc |
387 | make | |
388 | ||
389 | # Run dpdkring application | |
390 | ./build/dpdkr -c 1 -n 4 -- -n 0 | |
391 | where "-n 0" refers to ring '0' i.e dpdkr0 | |
392 | ``` | |
393 | ||
8a8b9c4f BB |
394 | ### 5.3 PHY-VM-PHY [VHOST MULTIQUEUE] |
395 | ||
396 | The steps (1-5) in 3.3 section of [INSTALL DPDK] guide will create & initialize DB, | |
397 | start vswitchd and add dpdk devices to bridge br0. | |
398 | ||
399 | 1. Configure PMD and RXQs. For example set no. of dpdk port rx queues to atleast 2. | |
400 | The number of rx queues at vhost-user interface gets automatically configured after | |
401 | virtio device connection and doesn't need manual configuration. | |
402 | ||
403 | ``` | |
404 | ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=c | |
405 | ovs-vsctl set Interface dpdk0 options:n_rxq=2 | |
406 | ovs-vsctl set Interface dpdk1 options:n_rxq=2 | |
407 | ``` | |
408 | ||
409 | 2. Instantiate Guest VM using Qemu cmdline | |
410 | ||
411 | Guest Configuration | |
412 | ||
413 | ``` | |
414 | | configuration | values | comments | |
415 | |----------------------|--------|----------------- | |
416 | | qemu version | 2.5.0 | | |
417 | | qemu thread affinity |2 cores | taskset 0x30 | |
418 | | memory | 4GB | - | |
419 | | cores | 2 | - | |
420 | | Qcow2 image |Fedora22| - | |
421 | | multiqueue | on | - | |
422 | ``` | |
423 | ||
424 | Instantiate Guest | |
425 | ||
426 | ``` | |
427 | export VM_NAME=vhost-vm | |
428 | export GUEST_MEM=4096M | |
429 | export QCOW2_IMAGE=/root/Fedora22_x86_64.qcow2 | |
430 | export VHOST_SOCK_DIR=/usr/local/var/run/openvswitch | |
431 | ||
432 | taskset 0x30 qemu-system-x86_64 -cpu host -smp 2,cores=2 -drive file=$QCOW2_IMAGE -m 4096M --enable-kvm -name $VM_NAME -nographic -object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on -numa node,memdev=mem -mem-prealloc -chardev socket,id=char1,path=$VHOST_SOCK_DIR/dpdkvhostuser0 -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=2 -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mq=on,vectors=6 -chardev socket,id=char2,path=$VHOST_SOCK_DIR/dpdkvhostuser1 -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=2 -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=6 | |
433 | ``` | |
434 | ||
435 | Note: Queue value above should match the queues configured in OVS, The vector value | |
436 | should be set to 'no. of queues x 2 + 2'. | |
437 | ||
438 | 3. Guest interface configuration | |
439 | ||
440 | Assuming there are 2 interfaces in the guest named eth0, eth1 check the channel | |
441 | configuration and set the number of combined channels to 2 for virtio devices. | |
442 | More information can be found in [Vhost walkthrough] section. | |
443 | ||
444 | ``` | |
445 | ethtool -l eth0 | |
446 | ethtool -L eth0 combined 2 | |
447 | ethtool -L eth1 combined 2 | |
448 | ``` | |
449 | ||
450 | 4. Kernel Packet forwarding | |
451 | ||
452 | Configure IP and enable interfaces | |
453 | ||
454 | ``` | |
455 | ifconfig eth0 5.5.5.1/24 up | |
456 | ifconfig eth1 90.90.90.1/24 up | |
457 | ``` | |
458 | ||
459 | Configure IP forwarding and add route entries | |
460 | ||
461 | ``` | |
462 | sysctl -w net.ipv4.ip_forward=1 | |
463 | sysctl -w net.ipv4.conf.all.rp_filter=0 | |
464 | sysctl -w net.ipv4.conf.eth0.rp_filter=0 | |
465 | sysctl -w net.ipv4.conf.eth1.rp_filter=0 | |
466 | ip route add 2.1.1.0/24 dev eth1 | |
467 | route add default gw 2.1.1.2 eth1 | |
468 | route add default gw 90.90.90.90 eth1 | |
469 | arp -s 90.90.90.90 DE:AD:BE:EF:CA:FE | |
470 | arp -s 2.1.1.2 DE:AD:BE:EF:CA:FA | |
471 | ``` | |
472 | ||
473 | Check traffic on multiple queues | |
474 | ||
475 | ``` | |
476 | cat /proc/interrupts | grep virtio | |
477 | ``` | |
478 | ||
c9b9d6df | 479 | ## <a name="vhost"></a> 6. Vhost Walkthrough |
2d24d165 CL |
480 | |
481 | Two types of vHost User ports are available in OVS: | |
482 | ||
483 | 1. vhost-user (dpdkvhostuser ports) | |
484 | ||
485 | 2. vhost-user-client (dpdkvhostuserclient ports) | |
486 | ||
487 | vHost User uses a client-server model. The server creates/manages/destroys the | |
488 | vHost User sockets, and the client connects to the server. Depending on which | |
489 | port type you use, dpdkvhostuser or dpdkvhostuserclient, a different | |
490 | configuration of the client-server model is used. | |
491 | ||
492 | For vhost-user ports, OVS DPDK acts as the server and QEMU the client. | |
493 | For vhost-user-client ports, OVS DPDK acts as the client and QEMU the server. | |
494 | ||
c9b9d6df BB |
495 | ### 6.1 vhost-user |
496 | ||
497 | - Prerequisites: | |
498 | ||
499 | QEMU version >= 2.2 | |
500 | ||
501 | - Adding vhost-user ports to Switch | |
502 | ||
503 | Unlike DPDK ring ports, DPDK vhost-user ports can have arbitrary names, | |
504 | except that forward and backward slashes are prohibited in the names. | |
505 | ||
506 | For vhost-user, the name of the port type is `dpdkvhostuser` | |
507 | ||
508 | ``` | |
509 | ovs-vsctl add-port br0 vhost-user-1 -- set Interface vhost-user-1 | |
510 | type=dpdkvhostuser | |
511 | ``` | |
512 | ||
513 | This action creates a socket located at | |
514 | `/usr/local/var/run/openvswitch/vhost-user-1`, which you must provide | |
515 | to your VM on the QEMU command line. More instructions on this can be | |
516 | found in the next section "Adding vhost-user ports to VM" | |
517 | ||
518 | Note: If you wish for the vhost-user sockets to be created in a | |
519 | sub-directory of `/usr/local/var/run/openvswitch`, you may specify | |
520 | this directory in the ovsdb like so: | |
521 | ||
522 | `./utilities/ovs-vsctl --no-wait \ | |
523 | set Open_vSwitch . other_config:vhost-sock-dir=subdir` | |
524 | ||
525 | - Adding vhost-user ports to VM | |
526 | ||
527 | 1. Configure sockets | |
528 | ||
529 | Pass the following parameters to QEMU to attach a vhost-user device: | |
530 | ||
531 | ``` | |
532 | -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 | |
533 | -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce | |
534 | -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 | |
535 | ``` | |
536 | ||
537 | where vhost-user-1 is the name of the vhost-user port added | |
538 | to the switch. | |
539 | Repeat the above parameters for multiple devices, changing the | |
540 | chardev path and id as necessary. Note that a separate and different | |
541 | chardev path needs to be specified for each vhost-user device. For | |
542 | example you have a second vhost-user port named 'vhost-user-2', you | |
543 | append your QEMU command line with an additional set of parameters: | |
544 | ||
545 | ``` | |
546 | -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 | |
547 | -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce | |
548 | -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2 | |
549 | ``` | |
550 | ||
551 | 2. Configure huge pages. | |
552 | ||
553 | QEMU must allocate the VM's memory on hugetlbfs. vhost-user ports access | |
554 | a virtio-net device's virtual rings and packet buffers mapping the VM's | |
555 | physical memory on hugetlbfs. To enable vhost-user ports to map the VM's | |
556 | memory into their process address space, pass the following parameters | |
557 | to QEMU: | |
558 | ||
559 | ``` | |
560 | -object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages, | |
561 | share=on -numa node,memdev=mem -mem-prealloc | |
562 | ``` | |
563 | ||
564 | 3. Enable multiqueue support(OPTIONAL) | |
565 | ||
81acebda IM |
566 | QEMU needs to be configured to use multiqueue. |
567 | The $q below is the number of queues. | |
c9b9d6df BB |
568 | The $v is the number of vectors, which is '$q x 2 + 2'. |
569 | ||
570 | ``` | |
571 | -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 | |
572 | -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=$q | |
573 | -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=$v | |
574 | ``` | |
575 | ||
81acebda IM |
576 | The vhost-user interface will be automatically reconfigured with required |
577 | number of rx and tx queues after connection of virtio device. | |
578 | Manual configuration of `n_rxq` is not supported because OVS will work | |
579 | properly only if `n_rxq` will match number of queues configured in QEMU. | |
580 | ||
c9b9d6df BB |
581 | A least 2 PMDs should be configured for the vswitch when using multiqueue. |
582 | Using a single PMD will cause traffic to be enqueued to the same vhost | |
583 | queue rather than being distributed among different vhost queues for a | |
584 | vhost-user interface. | |
585 | ||
586 | If traffic destined for a VM configured with multiqueue arrives to the | |
587 | vswitch via a physical DPDK port, then the number of rxqs should also be | |
588 | set to at least 2 for that physical DPDK port. This is required to increase | |
589 | the probability that a different PMD will handle the multiqueue | |
590 | transmission to the guest using a different vhost queue. | |
591 | ||
592 | If one wishes to use multiple queues for an interface in the guest, the | |
593 | driver in the guest operating system must be configured to do so. It is | |
594 | recommended that the number of queues configured be equal to '$q'. | |
595 | ||
596 | For example, this can be done for the Linux kernel virtio-net driver with: | |
597 | ||
598 | ``` | |
599 | ethtool -L <DEV> combined <$q> | |
600 | ``` | |
601 | where `-L`: Changes the numbers of channels of the specified network device | |
602 | and `combined`: Changes the number of multi-purpose channels. | |
603 | ||
604 | - VM Configuration with libvirt | |
605 | ||
606 | * change the user/group, access control policty and restart libvirtd. | |
607 | ||
608 | - In `/etc/libvirt/qemu.conf` add/edit the following lines | |
609 | ||
610 | ``` | |
611 | user = "root" | |
612 | group = "root" | |
613 | ``` | |
614 | ||
615 | - Disable SELinux or set to permissive mode | |
616 | ||
617 | `setenforce 0` | |
618 | ||
619 | - Restart the libvirtd process, For example, on Fedora | |
620 | ||
621 | `systemctl restart libvirtd.service` | |
622 | ||
623 | * Instantiate the VM | |
624 | ||
625 | - Copy the xml configuration from [Guest VM using libvirt] in to workspace. | |
626 | ||
627 | - Start the VM. | |
628 | ||
629 | `virsh create demovm.xml` | |
630 | ||
631 | - Connect to the guest console | |
632 | ||
633 | `virsh console demovm` | |
634 | ||
635 | * VM configuration | |
636 | ||
637 | The demovm xml configuration is aimed at achieving out of box performance | |
638 | on VM. | |
639 | ||
640 | - The vcpus are pinned to the cores of the CPU socket 0 using vcpupin. | |
641 | ||
642 | - Configure NUMA cell and memory shared using memAccess='shared'. | |
643 | ||
644 | - Disable mrg_rxbuf='off'. | |
645 | ||
646 | Note: For information on libvirt and further tuning refer [libvirt]. | |
647 | ||
2d24d165 CL |
648 | ### 6.2 vhost-user-client |
649 | ||
650 | - Prerequisites: | |
651 | ||
652 | QEMU version >= 2.7 | |
653 | ||
654 | - Adding vhost-user-client ports to Switch | |
655 | ||
656 | ``` | |
657 | ovs-vsctl add-port br0 vhost-client-1 -- set Interface vhost-client-1 | |
658 | type=dpdkvhostuserclient options:vhost-server-path=/path/to/socket | |
659 | ``` | |
660 | ||
661 | Unlike vhost-user ports, the name given to port does not govern the name of | |
662 | the socket device. 'vhost-server-path' reflects the full path of the socket | |
663 | that has been or will be created by QEMU for the given vHost User client | |
664 | port. | |
665 | ||
666 | - Adding vhost-user-client ports to VM | |
667 | ||
668 | The same QEMU parameters as vhost-user ports described in section 6.1 can | |
669 | be used, with one change necessary. One must append ',server' to the | |
670 | 'chardev' arguments on the QEMU command line, to instruct QEMU to use vHost | |
671 | server mode for a given interface, like so: | |
672 | ||
673 | ```` | |
674 | -chardev socket,id=char0,path=/path/to/socket,server | |
675 | ```` | |
676 | ||
677 | If the corresponding dpdkvhostuserclient port has not yet been configured | |
678 | in OVS with vhost-server-path=/path/to/socket, QEMU will print a log | |
679 | similar to the following: | |
680 | ||
681 | `QEMU waiting for connection on: disconnected:unix:/path/to/socket,server` | |
682 | ||
683 | QEMU will wait until the port is created sucessfully in OVS to boot the VM. | |
684 | ||
685 | One benefit of using this mode is the ability for vHost ports to | |
686 | 'reconnect' in event of the switch crashing or being brought down. Once it | |
687 | is brought back up, the vHost ports will reconnect automatically and normal | |
688 | service will resume. | |
689 | ||
690 | ### 6.3 DPDK backend inside VM | |
c9b9d6df BB |
691 | |
692 | Please note that additional configuration is required if you want to run | |
693 | ovs-vswitchd with DPDK backend inside a QEMU virtual machine. Ovs-vswitchd | |
694 | creates separate DPDK TX queues for each CPU core available. This operation | |
695 | fails inside QEMU virtual machine because, by default, VirtIO NIC provided | |
696 | to the guest is configured to support only single TX queue and single RX | |
697 | queue. To change this behavior, you need to turn on 'mq' (multiqueue) | |
698 | property of all virtio-net-pci devices emulated by QEMU and used by DPDK. | |
699 | You may do it manually (by changing QEMU command line) or, if you use | |
700 | Libvirt, by adding the following string: | |
701 | ||
702 | `<driver name='vhost' queues='N'/>` | |
703 | ||
704 | to <interface> sections of all network devices used by DPDK. Parameter 'N' | |
705 | determines how many queues can be used by the guest.This may not work with | |
706 | old versions of QEMU found in some distros and need Qemu version >= 2.2. | |
707 | ||
708 | ## <a name="qos"></a> 7. QOS | |
709 | ||
710 | Here is an example on QOS usage. | |
711 | Assuming you have a vhost-user port transmitting traffic consisting of | |
712 | packets of size 64 bytes, the following command would limit the egress | |
713 | transmission rate of the port to ~1,000,000 packets per second | |
714 | ||
715 | `ovs-vsctl set port vhost-user0 qos=@newqos -- --id=@newqos create qos | |
716 | type=egress-policer other-config:cir=46000000 other-config:cbs=2048` | |
717 | ||
718 | To examine the QoS configuration of the port: | |
719 | ||
720 | `ovs-appctl -t ovs-vswitchd qos/show vhost-user0` | |
721 | ||
722 | To clear the QoS configuration from the port and ovsdb use the following: | |
723 | ||
724 | `ovs-vsctl destroy QoS vhost-user0 -- clear Port vhost-user0 qos` | |
725 | ||
726 | For more details regarding egress-policer parameters please refer to the | |
727 | vswitch.xml. | |
728 | ||
729 | ## <a name="rl"></a> 8. Rate Limiting | |
730 | ||
731 | Here is an example on Ingress Policing usage. | |
732 | Assuming you have a vhost-user port receiving traffic consisting of | |
733 | packets of size 64 bytes, the following command would limit the reception | |
734 | rate of the port to ~1,000,000 packets per second: | |
735 | ||
736 | `ovs-vsctl set interface vhost-user0 ingress_policing_rate=368000 | |
737 | ingress_policing_burst=1000` | |
738 | ||
739 | To examine the ingress policer configuration of the port: | |
740 | ||
741 | `ovs-vsctl list interface vhost-user0` | |
742 | ||
743 | To clear the ingress policer configuration from the port use the following: | |
744 | ||
745 | `ovs-vsctl set interface vhost-user0 ingress_policing_rate=0` | |
746 | ||
747 | For more details regarding ingress-policer see the vswitch.xml. | |
748 | ||
9fd39370 SC |
749 | ## <a name="fc"></a> 9. Flow control. |
750 | Flow control can be enabled only on DPDK physical ports. | |
751 | To enable flow control support at tx side while adding a port, add the | |
752 | 'tx-flow-ctrl' option to the 'ovs-vsctl add-port' as in the eg: below. | |
753 | ||
754 | ``` | |
755 | ovs-vsctl add-port br0 dpdk0 -- \ | |
756 | set Interface dpdk0 type=dpdk options:tx-flow-ctrl=true | |
757 | ``` | |
758 | ||
759 | Similarly to enable rx flow control, | |
760 | ||
761 | ``` | |
762 | ovs-vsctl add-port br0 dpdk0 -- \ | |
763 | set Interface dpdk0 type=dpdk options:rx-flow-ctrl=true | |
764 | ``` | |
765 | ||
766 | And to enable the flow control auto-negotiation, | |
767 | ||
768 | ``` | |
769 | ovs-vsctl add-port br0 dpdk0 -- \ | |
770 | set Interface dpdk0 type=dpdk options:flow-ctrl-autoneg=true | |
771 | ``` | |
772 | ||
773 | To turn ON the tx flow control at run time(After the port is being added | |
774 | to OVS), the command-line input will be, | |
775 | ||
776 | `ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=true` | |
777 | ||
778 | The flow control parameters can be turned off by setting 'false' to the | |
779 | respective parameter. To disable the flow control at tx side, | |
780 | ||
781 | `ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false` | |
782 | ||
4b88d678 CL |
783 | ## <a name="pdump"></a> 10. Pdump |
784 | ||
785 | Pdump allows you to listen on DPDK ports and view the traffic that is | |
786 | passing on them. To use this utility, one must have libpcap installed | |
787 | on the system. Furthermore, DPDK must be built with CONFIG_RTE_LIBRTE_PDUMP=y | |
788 | and CONFIG_RTE_LIBRTE_PMD_PCAP=y. | |
789 | ||
790 | To use pdump, simply launch OVS as usual. Then, navigate to the 'app/pdump' | |
791 | directory in DPDK, 'make' the application and run like so: | |
792 | ||
793 | ``` | |
b2b24278 | 794 | sudo ./build/app/dpdk-pdump -- |
4b88d678 CL |
795 | --pdump port=0,queue=0,rx-dev=/tmp/pkts.pcap |
796 | --server-socket-path=/usr/local/var/run/openvswitch | |
797 | ``` | |
798 | ||
799 | The above command captures traffic received on queue 0 of port 0 and stores | |
800 | it in /tmp/pkts.pcap. Other combinations of port numbers, queues numbers and | |
b2b24278 MK |
801 | pcap locations are of course also available to use. For example, to capture |
802 | all packets that traverse port 0 in a single pcap file: | |
803 | ||
804 | ``` | |
805 | sudo ./build/app/dpdk-pdump -- | |
806 | --pdump 'port=0,queue=*,rx-dev=/tmp/pkts.pcap,tx-dev=/tmp/pkts.pcap' | |
807 | --server-socket-path=/usr/local/var/run/openvswitch | |
808 | ``` | |
809 | ||
810 | 'server-socket-path' must be set to the value of ovs_rundir() which typically | |
811 | resolves to '/usr/local/var/run/openvswitch'. | |
4b88d678 CL |
812 | More information on the pdump app and its usage can be found in the below link. |
813 | ||
814 | http://dpdk.org/doc/guides/sample_app_ug/pdump.html | |
815 | ||
816 | Many tools are available to view the contents of the pcap file. Once example is | |
817 | tcpdump. Issue the following command to view the contents of 'pkts.pcap': | |
818 | ||
819 | `tcpdump -r pkts.pcap` | |
820 | ||
821 | A performance decrease is expected when using a monitoring application like | |
822 | the DPDK pdump app. | |
823 | ||
0072e931 MK |
824 | ## <a name="jumbo"></a> 11. Jumbo Frames |
825 | ||
826 | By default, DPDK ports are configured with standard Ethernet MTU (1500B). To | |
827 | enable Jumbo Frames support for a DPDK port, change the Interface's `mtu_request` | |
828 | attribute to a sufficiently large value. | |
829 | ||
830 | e.g. Add a DPDK Phy port with MTU of 9000: | |
831 | ||
832 | `ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk -- set Interface dpdk0 mtu_request=9000` | |
833 | ||
834 | e.g. Change the MTU of an existing port to 6200: | |
835 | ||
836 | `ovs-vsctl set Interface dpdk0 mtu_request=6200` | |
837 | ||
838 | When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are | |
839 | increased, such that a full Jumbo Frame of a specific size may be accommodated | |
840 | within a single mbuf segment. | |
841 | ||
842 | Jumbo frame support has been validated against 9728B frames (largest frame size | |
843 | supported by Fortville NIC), using the DPDK `i40e` driver, but larger frames | |
844 | (particularly in use cases involving East-West traffic only), and other DPDK NIC | |
845 | drivers may be supported. | |
846 | ||
d3254e21 | 847 | ### 11.1 vHost Ports and Jumbo Frames |
0072e931 MK |
848 | |
849 | Some additional configuration is needed to take advantage of jumbo frames with | |
850 | vhost ports: | |
851 | ||
d3254e21 BB |
852 | 1. `mergeable buffers` must be enabled for vHost ports, as demonstrated in |
853 | the QEMU command line snippet below: | |
0072e931 | 854 | |
d3254e21 BB |
855 | ``` |
856 | '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \' | |
857 | '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on' | |
858 | ``` | |
0072e931 | 859 | |
d3254e21 BB |
860 | 2. Where virtio devices are bound to the Linux kernel driver in a guest |
861 | environment (i.e. interfaces are not bound to an in-guest DPDK driver), | |
862 | the MTU of those logical network interfaces must also be increased to a | |
863 | sufficiently large value. This avoids segmentation of Jumbo Frames | |
864 | received in the guest. Note that 'MTU' refers to the length of the IP | |
865 | packet only, and not that of the entire frame. | |
0072e931 | 866 | |
d3254e21 BB |
867 | To calculate the exact MTU of a standard IPv4 frame, subtract the L2 |
868 | header and CRC lengths (i.e. 18B) from the max supported frame size. | |
869 | So, to set the MTU for a 9018B Jumbo Frame: | |
0072e931 | 870 | |
d3254e21 BB |
871 | ``` |
872 | ifconfig eth1 mtu 9000 | |
873 | ``` | |
0072e931 MK |
874 | |
875 | ## <a name="vsperf"></a> 12. Vsperf | |
c9b9d6df BB |
876 | |
877 | Vsperf project goal is to develop vSwitch test framework that can be used to | |
878 | validate the suitability of different vSwitch implementations in a Telco deployment | |
879 | environment. More information can be found in below link. | |
880 | ||
881 | https://wiki.opnfv.org/display/vsperf/VSperf+Home | |
882 | ||
883 | ||
884 | Bug Reporting: | |
885 | -------------- | |
886 | ||
887 | Please report problems to bugs@openvswitch.org. | |
888 | ||
889 | ||
890 | [INSTALL.userspace.md]:INSTALL.userspace.md | |
891 | [INSTALL.md]:INSTALL.md | |
892 | [DPDK Linux GSG]: http://www.dpdk.org/doc/guides/linux_gsg/build_dpdk.html#binding-and-unbinding-network-ports-to-from-the-igb-uioor-vfio-modules | |
893 | [DPDK Docs]: http://dpdk.org/doc | |
894 | [libvirt]: http://libvirt.org/formatdomain.html | |
895 | [Guest VM using libvirt]: INSTALL.DPDK.md#ovstc | |
8a8b9c4f | 896 | [Vhost walkthrough]: INSTALL.DPDK.md#vhost |
c9b9d6df BB |
897 | [INSTALL DPDK]: INSTALL.DPDK.md#build |
898 | [INSTALL OVS]: INSTALL.DPDK.md#build |