]>
Commit | Line | Data |
---|---|---|
c50938a2 SF |
1 | .. |
2 | Licensed under the Apache License, Version 2.0 (the "License"); you may | |
3 | not use this file except in compliance with the License. You may obtain | |
4 | a copy of the License at | |
5 | ||
6 | http://www.apache.org/licenses/LICENSE-2.0 | |
7 | ||
8 | Unless required by applicable law or agreed to in writing, software | |
9 | distributed under the License is distributed on an "AS IS" BASIS, WITHOUT | |
10 | WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the | |
11 | License for the specific language governing permissions and limitations | |
12 | under the License. | |
13 | ||
14 | Convention for heading levels in Open vSwitch documentation: | |
15 | ||
16 | ======= Heading 0 (reserved for the title in a document) | |
17 | ------- Heading 1 | |
18 | ~~~~~~~ Heading 2 | |
19 | +++++++ Heading 3 | |
20 | ''''''' Heading 4 | |
21 | ||
22 | Avoid deeper levels because they do not render well. | |
23 | ||
24 | ================================= | |
25 | Open vSwitch with DPDK (Advanced) | |
26 | ================================= | |
27 | ||
28 | The Advanced Install Guide explains how to improve OVS performance when using | |
29 | DPDK datapath. This guide provides information on tuning, system configuration, | |
30 | troubleshooting, static code analysis and testcases. | |
31 | ||
32 | Building as a Shared Library | |
33 | ---------------------------- | |
34 | ||
35 | DPDK can be built as a static or a shared library and shall be linked by | |
36 | applications using DPDK datapath. When building OVS with DPDK, you can link | |
37 | Open vSwitch against the shared DPDK library. | |
38 | ||
39 | .. note:: | |
40 | Minor performance loss is seen with OVS when using shared DPDK library as | |
41 | compared to static library. | |
42 | ||
43 | To build Open vSwitch using DPDK as a shared library, first refer to the `DPDK | |
44 | installation guide`_ for download instructions for DPDK and OVS. | |
45 | ||
46 | Once DPDK and OVS have been downloaded, you must configure the DPDK library | |
47 | accordingly. Simply set ``CONFIG_RTE_BUILD_SHARED_LIB=y`` in | |
48 | ``config/common_base``, then build and install DPDK. Once done, DPDK can be | |
49 | built as usual. For example:: | |
50 | ||
51 | $ export DPDK_TARGET=x86_64-native-linuxapp-gcc | |
52 | $ export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET | |
53 | $ make install T=$DPDK_TARGET DESTDIR=install | |
54 | ||
55 | Once DPDK is built, export the DPDK shared library location and setup OVS as | |
56 | detailed in the `DPDK installation guide`_:: | |
57 | ||
58 | $ export LD_LIBRARY_PATH=$DPDK_DIR/x86_64-native-linuxapp-gcc/lib | |
59 | ||
60 | System Configuration | |
61 | -------------------- | |
62 | ||
63 | To achieve optimal OVS performance, the system can be configured and that | |
64 | includes BIOS tweaks, Grub cmdline additions, better understanding of NUMA | |
65 | nodes and apt selection of PCIe slots for NIC placement. | |
66 | ||
67 | Recommended BIOS Settings | |
68 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | |
69 | ||
70 | .. list-table:: Recommended BIOS Settings | |
71 | :header-rows: 1 | |
72 | ||
73 | * - Setting | |
74 | - Value | |
75 | * - C3 Power State | |
76 | - Disabled | |
77 | * - C6 Power State | |
78 | - Disabled | |
79 | * - MLC Streamer | |
80 | - Enabled | |
81 | * - MLC Spacial Prefetcher | |
82 | - Enabled | |
83 | * - DCU Data Prefetcher | |
84 | - Enabled | |
85 | * - DCA | |
86 | - Enabled | |
87 | * - CPU Power and Performance | |
88 | - Performance | |
89 | * - Memeory RAS and Performance Config -> NUMA optimized | |
90 | - Enabled | |
91 | ||
92 | PCIe Slot Selection | |
93 | ~~~~~~~~~~~~~~~~~~~ | |
94 | ||
95 | The fastpath performance can be affected by factors related to the placement of | |
96 | the NIC, such as channel speeds between PCIe slot and CPU or the proximity of | |
97 | PCIe slot to the CPU cores running the DPDK application. Listed below are the | |
98 | steps to identify right PCIe slot. | |
99 | ||
100 | #. Retrieve host details using ``dmidecode``. For example:: | |
101 | ||
102 | $ dmidecode -t baseboard | grep "Product Name" | |
103 | ||
104 | #. Download the technical specification for product listed, e.g: S2600WT2 | |
105 | ||
106 | #. Check the Product Architecture Overview on the Riser slot placement, CPU | |
107 | sharing info and also PCIe channel speeds | |
108 | ||
109 | For example: On S2600WT, CPU1 and CPU2 share Riser Slot 1 with Channel speed | |
110 | between CPU1 and Riser Slot1 at 32GB/s, CPU2 and Riser Slot1 at 16GB/s. | |
111 | Running DPDK app on CPU1 cores and NIC inserted in to Riser card Slots will | |
112 | optimize OVS performance in this case. | |
113 | ||
114 | #. Check the Riser Card #1 - Root Port mapping information, on the available | |
115 | slots and individual bus speeds. In S2600WT slot 1, slot 2 has high bus | |
116 | speeds and are potential slots for NIC placement. | |
117 | ||
118 | Advanced Hugepage Setup | |
119 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
120 | ||
121 | Allocate and mount 1 GB hugepages. | |
122 | ||
123 | - For persistent allocation of huge pages, add the following options to the | |
124 | kernel bootline:: | |
125 | ||
126 | default_hugepagesz=1GB hugepagesz=1G hugepages=N | |
127 | ||
128 | For platforms supporting multiple huge page sizes, add multiple options:: | |
129 | ||
130 | default_hugepagesz=<size> hugepagesz=<size> hugepages=N | |
131 | ||
132 | where: | |
133 | ||
134 | ``N`` | |
135 | number of huge pages requested | |
136 | ``size`` | |
137 | huge page size with an optional suffix ``[kKmMgG]`` | |
138 | ||
139 | - For run-time allocation of huge pages:: | |
140 | ||
141 | $ echo N > /sys/devices/system/node/nodeX/hugepages/hugepages-1048576kB/nr_hugepages | |
142 | ||
143 | where: | |
144 | ||
145 | ``N`` | |
146 | number of huge pages requested | |
147 | ``X`` | |
148 | NUMA Node | |
149 | ||
150 | .. note:: | |
151 | For run-time allocation of 1G huge pages, Contiguous Memory Allocator | |
152 | (``CONFIG_CMA``) has to be supported by kernel, check your Linux distro. | |
153 | ||
154 | Now mount the huge pages, if not already done so:: | |
155 | ||
156 | $ mount -t hugetlbfs -o pagesize=1G none /dev/hugepages | |
157 | ||
158 | Enable HyperThreading | |
159 | ~~~~~~~~~~~~~~~~~~~~~ | |
160 | ||
161 | With HyperThreading, or SMT, enabled, a physical core appears as two logical | |
162 | cores. SMT can be utilized to spawn worker threads on logical cores of the same | |
163 | physical core there by saving additional cores. | |
164 | ||
165 | With DPDK, when pinning pmd threads to logical cores, care must be taken to set | |
166 | the correct bits of the ``pmd-cpu-mask`` to ensure that the pmd threads are | |
167 | pinned to SMT siblings. | |
168 | ||
169 | Take a sample system configuration, with 2 sockets, 2 * 10 core processors, HT | |
170 | enabled. This gives us a total of 40 logical cores. To identify the physical | |
171 | core shared by two logical cores, run:: | |
172 | ||
173 | $ cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list | |
174 | ||
175 | where ``N`` is the logical core number. | |
176 | ||
177 | In this example, it would show that cores ``1`` and ``21`` share the same | |
178 | physical core., thus, the ``pmd-cpu-mask`` can be used to enable these two pmd | |
179 | threads running on these two logical cores (one physical core) is:: | |
180 | ||
181 | $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=100002 | |
182 | ||
183 | Isolate Cores | |
184 | ~~~~~~~~~~~~~ | |
185 | ||
186 | The ``isolcpus`` option can be used to isolate cores from the Linux scheduler. | |
187 | The isolated cores can then be used to dedicatedly run HPC applications or | |
188 | threads. This helps in better application performance due to zero context | |
189 | switching and minimal cache thrashing. To run platform logic on core 0 and | |
190 | isolate cores between 1 and 19 from scheduler, add ``isolcpus=1-19`` to GRUB | |
191 | cmdline. | |
192 | ||
193 | .. note:: | |
194 | It has been verified that core isolation has minimal advantage due to mature | |
195 | Linux scheduler in some circumstances. | |
196 | ||
197 | NUMA/Cluster-on-Die | |
198 | ~~~~~~~~~~~~~~~~~~~ | |
199 | ||
200 | Ideally inter-NUMA datapaths should be avoided where possible as packets will | |
201 | go across QPI and there may be a slight performance penalty when compared with | |
202 | intra NUMA datapaths. On Intel Xeon Processor E5 v3, Cluster On Die is | |
203 | introduced on models that have 10 cores or more. This makes it possible to | |
204 | logically split a socket into two NUMA regions and again it is preferred where | |
205 | possible to keep critical datapaths within the one cluster. | |
206 | ||
207 | It is good practice to ensure that threads that are in the datapath are pinned | |
208 | to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs responsible for | |
209 | forwarding. If DPDK is built with ``CONFIG_RTE_LIBRTE_VHOST_NUMA=y``, vHost | |
210 | User ports automatically detect the NUMA socket of the QEMU vCPUs and will be | |
211 | serviced by a PMD from the same node provided a core on this node is enabled in | |
212 | the ``pmd-cpu-mask``. ``libnuma`` packages are required for this feature. | |
213 | ||
214 | Compiler Optimizations | |
215 | ~~~~~~~~~~~~~~~~~~~~~~ | |
216 | ||
217 | The default compiler optimization level is ``-O2``. Changing this to more | |
218 | aggressive compiler optimization such as ``-O3 -march=native`` with | |
219 | gcc (verified on 5.3.1) can produce performance gains though not siginificant. | |
220 | ``-march=native`` will produce optimized code on local machine and should be | |
221 | used when software compilation is done on Testbed. | |
222 | ||
223 | Performance Tuning | |
224 | ------------------ | |
225 | ||
226 | Affinity | |
227 | ~~~~~~~~ | |
228 | ||
229 | For superior performance, DPDK pmd threads and Qemu vCPU threads needs to be | |
230 | affinitized accordingly. | |
231 | ||
232 | - PMD thread Affinity | |
233 | ||
234 | A poll mode driver (pmd) thread handles the I/O of all DPDK interfaces | |
235 | assigned to it. A pmd thread shall poll the ports for incoming packets, | |
236 | switch the packets and send to tx port. pmd thread is CPU bound, and needs | |
237 | to be affinitized to isolated cores for optimum performance. | |
238 | ||
239 | By setting a bit in the mask, a pmd thread is created and pinned to the | |
240 | corresponding CPU core. e.g. to run a pmd thread on core 2:: | |
241 | ||
242 | $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=4 | |
243 | ||
244 | .. note:: | |
245 | pmd thread on a NUMA node is only created if there is at least one DPDK | |
246 | interface from that NUMA node added to OVS. | |
247 | ||
248 | - QEMU vCPU thread Affinity | |
249 | ||
250 | A VM performing simple packet forwarding or running complex packet pipelines | |
251 | has to ensure that the vCPU threads performing the work has as much CPU | |
252 | occupancy as possible. | |
253 | ||
254 | For example, on a multicore VM, multiple QEMU vCPU threads shall be spawned. | |
255 | When the DPDK ``testpmd`` application that does packet forwarding is invoked, | |
256 | the ``taskset`` command should be used to affinitize the vCPU threads to the | |
257 | dedicated isolated cores on the host system. | |
258 | ||
259 | Multiple Poll-Mode Driver Threads | |
260 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
261 | ||
262 | With pmd multi-threading support, OVS creates one pmd thread for each NUMA node | |
263 | by default. However, in cases where there are multiple ports/rxq's producing | |
264 | traffic, performance can be improved by creating multiple pmd threads running | |
265 | on separate cores. These pmd threads can share the workload by each being | |
266 | responsible for different ports/rxq's. Assignment of ports/rxq's to pmd threads | |
267 | is done automatically. | |
268 | ||
269 | A set bit in the mask means a pmd thread is created and pinned to the | |
270 | corresponding CPU core. For example, to run pmd threads on core 1 and 2:: | |
271 | ||
272 | $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=6 | |
273 | ||
274 | When using dpdk and dpdkvhostuser ports in a bi-directional VM loopback as | |
275 | shown below, spreading the workload over 2 or 4 pmd threads shows significant | |
276 | improvements as there will be more total CPU occupancy available:: | |
277 | ||
278 | NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1 | |
279 | ||
280 | DPDK Physical Port Rx Queues | |
281 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
282 | ||
283 | :: | |
284 | ||
285 | $ ovs-vsctl set Interface <DPDK interface> options:n_rxq=<integer> | |
286 | ||
287 | The command above sets the number of rx queues for DPDK physical interface. | |
288 | The rx queues are assigned to pmd threads on the same NUMA node in a | |
289 | round-robin fashion. | |
290 | ||
291 | DPDK Physical Port Queue Sizes | |
292 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
293 | ||
294 | :: | |
295 | ||
296 | $ ovs-vsctl set Interface dpdk0 options:n_rxq_desc=<integer> | |
297 | $ ovs-vsctl set Interface dpdk0 options:n_txq_desc=<integer> | |
298 | ||
299 | The command above sets the number of rx/tx descriptors that the NIC associated | |
300 | with dpdk0 will be initialised with. | |
301 | ||
302 | Different ``n_rxq_desc`` and ``n_txq_desc`` configurations yield different | |
303 | benefits in terms of throughput and latency for different scenarios. | |
304 | Generally, smaller queue sizes can have a positive impact for latency at the | |
305 | expense of throughput. The opposite is often true for larger queue sizes. | |
306 | Note: increasing the number of rx descriptors eg. to 4096 may have a negative | |
307 | impact on performance due to the fact that non-vectorised DPDK rx functions may | |
308 | be used. This is dependant on the driver in use, but is true for the commonly | |
309 | used i40e and ixgbe DPDK drivers. | |
310 | ||
311 | Exact Match Cache | |
312 | ~~~~~~~~~~~~~~~~~ | |
313 | ||
314 | Each pmd thread contains one Exact Match Cache (EMC). After initial flow setup | |
315 | in the datapath, the EMC contains a single table and provides the lowest level | |
316 | (fastest) switching for DPDK ports. If there is a miss in the EMC then the next | |
317 | level where switching will occur is the datapath classifier. Missing in the | |
318 | EMC and looking up in the datapath classifier incurs a significant performance | |
319 | penalty. If lookup misses occur in the EMC because it is too small to handle | |
320 | the number of flows, its size can be increased. The EMC size can be modified by | |
321 | editing the define ``EM_FLOW_HASH_SHIFT`` in ``lib/dpif-netdev.c``. | |
322 | ||
323 | As mentioned above, an EMC is per pmd thread. An alternative way of increasing | |
324 | the aggregate amount of possible flow entries in EMC and avoiding datapath | |
325 | classifier lookups is to have multiple pmd threads running. | |
326 | ||
327 | Rx Mergeable Buffers | |
328 | ~~~~~~~~~~~~~~~~~~~~ | |
329 | ||
330 | Rx mergeable buffers is a virtio feature that allows chaining of multiple | |
331 | virtio descriptors to handle large packet sizes. Large packets are handled by | |
332 | reserving and chaining multiple free descriptors together. Mergeable buffer | |
333 | support is negotiated between the virtio driver and virtio device and is | |
334 | supported by the DPDK vhost library. This behavior is supported and enabled by | |
335 | default, however in the case where the user knows that rx mergeable buffers are | |
336 | not needed i.e. jumbo frames are not needed, it can be forced off by adding | |
337 | ``mrg_rxbuf=off`` to the QEMU command line options. By not reserving multiple | |
338 | chains of descriptors it will make more individual virtio descriptors available | |
339 | for rx to the guest using dpdkvhost ports and this can improve performance. | |
340 | ||
341 | OVS Testcases | |
342 | ------------- | |
343 | ||
344 | PHY-VM-PHY (vHost Loopback) | |
345 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
346 | ||
347 | The `DPDK installation guide`_ details steps for PHY-VM-PHY loopback testcase | |
348 | and packet forwarding using DPDK testpmd application in the Guest VM. For users | |
349 | wishing to do packet forwarding using kernel stack below, you need to run the | |
350 | below commands on the guest:: | |
351 | ||
352 | $ ifconfig eth1 1.1.1.2/24 | |
353 | $ ifconfig eth2 1.1.2.2/24 | |
354 | $ systemctl stop firewalld.service | |
355 | $ systemctl stop iptables.service | |
356 | $ sysctl -w net.ipv4.ip_forward=1 | |
357 | $ sysctl -w net.ipv4.conf.all.rp_filter=0 | |
358 | $ sysctl -w net.ipv4.conf.eth1.rp_filter=0 | |
359 | $ sysctl -w net.ipv4.conf.eth2.rp_filter=0 | |
360 | $ route add -net 1.1.2.0/24 eth2 | |
361 | $ route add -net 1.1.1.0/24 eth1 | |
362 | $ arp -s 1.1.2.99 DE:AD:BE:EF:CA:FE | |
363 | $ arp -s 1.1.1.99 DE:AD:BE:EF:CA:EE | |
364 | ||
365 | PHY-VM-PHY (IVSHMEM) | |
366 | ~~~~~~~~~~~~~~~~~~~~ | |
367 | ||
368 | IVSHMEM can also be validated using the PHY-VM-PHY configuration. To begin, | |
369 | follow the steps described in the `DPDK installation guide`_ to create and | |
370 | initialize the database, start ovs-vswitchd and add ``dpdk``-type devices to | |
371 | bridge ``br0``. Once complete, follow the below steps: | |
372 | ||
373 | 1. Add DPDK ring port to the bridge:: | |
374 | ||
375 | $ ovs-vsctl add-port br0 dpdkr0 -- set Interface dpdkr0 type=dpdkr | |
376 | ||
377 | 2. Build modified QEMU | |
378 | ||
379 | QEMU must be patched to enable IVSHMEM support:: | |
380 | ||
381 | $ cd /usr/src/ | |
382 | $ wget http://wiki.qemu.org/download/qemu-2.2.1.tar.bz2 | |
383 | $ tar -jxvf qemu-2.2.1.tar.bz2 | |
384 | $ cd /usr/src/qemu-2.2.1 | |
385 | $ wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/patches/ivshmem-qemu-2.2.1.patch | |
386 | $ patch -p1 < ivshmem-qemu-2.2.1.patch | |
387 | $ ./configure --target-list=x86_64-softmmu --enable-debug --extra-cflags='-g' | |
388 | $ make -j 4 | |
389 | ||
390 | 3. Generate QEMU commandline:: | |
391 | ||
392 | $ mkdir -p /usr/src/cmdline_generator | |
393 | $ cd /usr/src/cmdline_generator | |
394 | $ wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/cmdline_generator/cmdline_generator.c | |
395 | $ wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/cmdline_generator/Makefile | |
396 | $ export RTE_SDK=/usr/src/dpdk-16.07 | |
397 | $ export RTE_TARGET=x86_64-ivshmem-linuxapp-gcc | |
398 | $ make | |
399 | $ ./build/cmdline_generator -m -p dpdkr0 XXX | |
400 | $ cmdline=`cat OVSMEMPOOL` | |
401 | ||
402 | 4. Start guest VM:: | |
403 | ||
404 | $ export VM_NAME=ivshmem-vm | |
405 | $ export QCOW2_IMAGE=/root/CentOS7_x86_64.qcow2 | |
406 | $ export QEMU_BIN=/usr/src/qemu-2.2.1/x86_64-softmmu/qemu-system-x86_64 | |
407 | $ taskset 0x20 $QEMU_BIN -cpu host -smp 2,cores=2 -hda $QCOW2_IMAGE \ | |
408 | -m 4096 --enable-kvm -name $VM_NAME -nographic -vnc :2 \ | |
409 | -pidfile /tmp/vm1.pid $cmdline | |
410 | ||
411 | 5. Build and run the sample ``dpdkr`` app in VM:: | |
412 | ||
413 | $ echo 1024 > /proc/sys/vm/nr_hugepages | |
414 | $ mount -t hugetlbfs nodev /dev/hugepages (if not already mounted) | |
415 | ||
416 | # Build the DPDK ring application in the VM | |
417 | $ export RTE_SDK=/root/dpdk-16.07 | |
418 | $ export RTE_TARGET=x86_64-ivshmem-linuxapp-gcc | |
419 | $ make | |
420 | ||
421 | # Run dpdkring application | |
422 | $ ./build/dpdkr -c 1 -n 4 -- -n 0 | |
423 | # where "-n 0" refers to ring '0' i.e dpdkr0 | |
424 | ||
425 | PHY-VM-PHY (vHost Multiqueue) | |
426 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
427 | ||
428 | vHost Multique functionality can also be validated using the PHY-VM-PHY | |
429 | configuration. To begin, follow the steps described in the `DPDK installation | |
430 | guide`_ to create and initialize the database, start ovs-vswitchd and add | |
431 | ``dpdk``-type devices to bridge ``br0``. Once complete, follow the below steps: | |
432 | ||
433 | 1. Configure PMD and RXQs. | |
434 | ||
435 | For example, set the number of dpdk port rx queues to at least 2 The number | |
436 | of rx queues at vhost-user interface gets automatically configured after | |
437 | virtio device connection and doesn't need manual configuration:: | |
438 | ||
439 | $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=c | |
440 | $ ovs-vsctl set Interface dpdk0 options:n_rxq=2 | |
441 | $ ovs-vsctl set Interface dpdk1 options:n_rxq=2 | |
442 | ||
443 | 2. Instantiate Guest VM using QEMU cmdline | |
444 | ||
445 | We must configure with appropriate software versions to ensure this feature | |
446 | is supported. | |
447 | ||
448 | .. list-table:: Recommended BIOS Settings | |
449 | :header-rows: 1 | |
450 | ||
451 | * - Setting | |
452 | - Value | |
453 | * - QEMU version | |
454 | - 2.5.0 | |
455 | * - QEMU thread affinity | |
456 | - 2 cores (taskset 0x30) | |
457 | * - Memory | |
458 | - 4 GB | |
459 | * - Cores | |
460 | - 2 | |
461 | * - Distro | |
462 | - Fedora 22 | |
463 | * - Multiqueue | |
464 | - Enabled | |
465 | ||
466 | To do this, instantiate the guest as follows:: | |
467 | ||
468 | $ export VM_NAME=vhost-vm | |
469 | $ export GUEST_MEM=4096M | |
470 | $ export QCOW2_IMAGE=/root/Fedora22_x86_64.qcow2 | |
471 | $ export VHOST_SOCK_DIR=/usr/local/var/run/openvswitch | |
472 | $ taskset 0x30 qemu-system-x86_64 -cpu host -smp 2,cores=2 -m 4096M \ | |
473 | -drive file=$QCOW2_IMAGE --enable-kvm -name $VM_NAME \ | |
474 | -nographic -numa node,memdev=mem -mem-prealloc \ | |
475 | -object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on \ | |
476 | -chardev socket,id=char1,path=$VHOST_SOCK_DIR/dpdkvhostuser0 \ | |
477 | -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=2 \ | |
478 | -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mq=on,vectors=6 \ | |
479 | -chardev socket,id=char2,path=$VHOST_SOCK_DIR/dpdkvhostuser1 \ | |
480 | -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=2 \ | |
481 | -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=6 | |
482 | ||
483 | .. note:: | |
484 | Queue value above should match the queues configured in OVS, The vector | |
485 | value should be set to "number of queues x 2 + 2" | |
486 | ||
487 | 3. Configure the guest interface | |
488 | ||
489 | Assuming there are 2 interfaces in the guest named eth0, eth1 check the | |
490 | channel configuration and set the number of combined channels to 2 for | |
491 | virtio devices:: | |
492 | ||
493 | $ ethtool -l eth0 | |
494 | $ ethtool -L eth0 combined 2 | |
495 | $ ethtool -L eth1 combined 2 | |
496 | ||
497 | More information can be found in vHost walkthrough section. | |
498 | ||
499 | 4. Configure kernel packet forwarding | |
500 | ||
501 | Configure IP and enable interfaces:: | |
502 | ||
503 | $ ifconfig eth0 5.5.5.1/24 up | |
504 | $ ifconfig eth1 90.90.90.1/24 up | |
505 | ||
506 | Configure IP forwarding and add route entries:: | |
507 | ||
508 | $ sysctl -w net.ipv4.ip_forward=1 | |
509 | $ sysctl -w net.ipv4.conf.all.rp_filter=0 | |
510 | $ sysctl -w net.ipv4.conf.eth0.rp_filter=0 | |
511 | $ sysctl -w net.ipv4.conf.eth1.rp_filter=0 | |
512 | $ ip route add 2.1.1.0/24 dev eth1 | |
513 | $ route add default gw 2.1.1.2 eth1 | |
514 | $ route add default gw 90.90.90.90 eth1 | |
515 | $ arp -s 90.90.90.90 DE:AD:BE:EF:CA:FE | |
516 | $ arp -s 2.1.1.2 DE:AD:BE:EF:CA:FA | |
517 | ||
518 | Check traffic on multiple queues:: | |
519 | ||
520 | $ cat /proc/interrupts | grep virtio | |
521 | ||
522 | vHost Walkthrough | |
523 | ----------------- | |
524 | ||
525 | Two types of vHost User ports are available in OVS: | |
526 | ||
527 | - vhost-user (``dpdkvhostuser``) | |
528 | ||
529 | - vhost-user-client (``dpdkvhostuserclient``) | |
530 | ||
531 | vHost User uses a client-server model. The server creates/manages/destroys the | |
532 | vHost User sockets, and the client connects to the server. Depending on which | |
533 | port type you use, ``dpdkvhostuser`` or ``dpdkvhostuserclient``, a different | |
534 | configuration of the client-server model is used. | |
535 | ||
536 | For vhost-user ports, Open vSwitch acts as the server and QEMU the client. For | |
537 | vhost-user-client ports, Open vSwitch acts as the client and QEMU the server. | |
538 | ||
539 | vhost-user | |
540 | ~~~~~~~~~~ | |
541 | ||
542 | 1. Install the prerequisites: | |
543 | ||
544 | - QEMU version >= 2.2 | |
545 | ||
546 | 2. Add vhost-user ports to the switch. | |
547 | ||
548 | Unlike DPDK ring ports, DPDK vhost-user ports can have arbitrary names, | |
549 | except that forward and backward slashes are prohibited in the names. | |
550 | ||
551 | For vhost-user, the name of the port type is ``dpdkvhostuser``:: | |
552 | ||
553 | $ ovs-vsctl add-port br0 vhost-user-1 -- set Interface vhost-user-1 \ | |
554 | type=dpdkvhostuser | |
555 | ||
556 | This action creates a socket located at | |
557 | ``/usr/local/var/run/openvswitch/vhost-user-1``, which you must provide to | |
558 | your VM on the QEMU command line. More instructions on this can be found in | |
559 | the next section "Adding vhost-user ports to VM" | |
560 | ||
561 | .. note:: | |
562 | If you wish for the vhost-user sockets to be created in a sub-directory of | |
563 | ``/usr/local/var/run/openvswitch``, you may specify this directory in the | |
564 | ovsdb like so:: | |
565 | ||
566 | $ ovs-vsctl --no-wait \ | |
567 | set Open_vSwitch . other_config:vhost-sock-dir=subdir` | |
568 | ||
569 | 3. Add vhost-user ports to VM | |
570 | ||
571 | 1. Configure sockets | |
572 | ||
573 | Pass the following parameters to QEMU to attach a vhost-user device:: | |
574 | ||
575 | -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 | |
576 | -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce | |
577 | -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 | |
578 | ||
579 | where ``vhost-user-1`` is the name of the vhost-user port added to the | |
580 | switch. | |
581 | ||
582 | Repeat the above parameters for multiple devices, changing the chardev | |
583 | ``path`` and ``id`` as necessary. Note that a separate and different | |
584 | chardev ``path`` needs to be specified for each vhost-user device. For | |
585 | example you have a second vhost-user port named ``vhost-user-2``, you | |
586 | append your QEMU command line with an additional set of parameters:: | |
587 | ||
588 | -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 | |
589 | -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce | |
590 | -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2 | |
591 | ||
592 | 2. Configure hugepages | |
593 | ||
594 | QEMU must allocate the VM's memory on hugetlbfs. vhost-user ports access | |
595 | a virtio-net device's virtual rings and packet buffers mapping the VM's | |
596 | physical memory on hugetlbfs. To enable vhost-user ports to map the VM's | |
597 | memory into their process address space, pass the following parameters | |
598 | to QEMU:: | |
599 | ||
600 | -object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on | |
601 | -numa node,memdev=mem -mem-prealloc | |
602 | ||
603 | 3. Enable multiqueue support (optional) | |
604 | ||
605 | QEMU needs to be configured to use multiqueue:: | |
606 | ||
607 | -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 | |
608 | -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=$q | |
609 | -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=$v | |
610 | ||
611 | where: | |
612 | ||
613 | ``$q`` | |
614 | The number of queues | |
615 | ``$v`` | |
616 | The number of vectors, which is ``$q`` * 2 + 2 | |
617 | ||
618 | The vhost-user interface will be automatically reconfigured with | |
619 | required number of rx and tx queues after connection of virtio device. | |
620 | Manual configuration of ``n_rxq`` is not supported because OVS will work | |
621 | properly only if ``n_rxq`` will match number of queues configured in | |
622 | QEMU. | |
623 | ||
624 | A least 2 PMDs should be configured for the vswitch when using | |
625 | multiqueue. Using a single PMD will cause traffic to be enqueued to the | |
626 | same vhost queue rather than being distributed among different vhost | |
627 | queues for a vhost-user interface. | |
628 | ||
629 | If traffic destined for a VM configured with multiqueue arrives to the | |
630 | vswitch via a physical DPDK port, then the number of rxqs should also be | |
631 | set to at least 2 for that physical DPDK port. This is required to | |
632 | increase the probability that a different PMD will handle the multiqueue | |
633 | transmission to the guest using a different vhost queue. | |
634 | ||
635 | If one wishes to use multiple queues for an interface in the guest, the | |
636 | driver in the guest operating system must be configured to do so. It is | |
637 | recommended that the number of queues configured be equal to ``$q``. | |
638 | ||
639 | For example, this can be done for the Linux kernel virtio-net driver | |
640 | with:: | |
641 | ||
642 | $ ethtool -L <DEV> combined <$q> | |
643 | ||
644 | where: | |
645 | ||
646 | ``-L`` | |
647 | Changes the numbers of channels of the specified network device | |
648 | ``combined`` | |
649 | Changes the number of multi-purpose channels. | |
650 | ||
651 | Configure the VM using libvirt | |
652 | ++++++++++++++++++++++++++++++ | |
653 | ||
654 | You can also build and configure the VM using libvirt rather than QEMU by | |
655 | itself. | |
656 | ||
657 | 1. Change the user/group, access control policty and restart libvirtd. | |
658 | ||
659 | - In ``/etc/libvirt/qemu.conf`` add/edit the following lines:: | |
660 | ||
661 | user = "root" | |
662 | group = "root" | |
663 | ||
664 | - Disable SELinux or set to permissive mode:: | |
665 | ||
666 | $ setenforce 0 | |
667 | ||
668 | - Restart the libvirtd process, For example, on Fedora:: | |
669 | ||
670 | $ systemctl restart libvirtd.service | |
671 | ||
672 | 2. Instantiate the VM | |
673 | ||
674 | - Copy the XML configuration described in the `DPDK installation guide`_. | |
675 | ||
676 | - Start the VM:: | |
677 | ||
678 | $ virsh create demovm.xml | |
679 | ||
680 | - Connect to the guest console:: | |
681 | ||
682 | $ virsh console demovm | |
683 | ||
684 | 3. Configure the VM | |
685 | ||
686 | The demovm xml configuration is aimed at achieving out of box performance on | |
687 | VM. | |
688 | ||
689 | - The vcpus are pinned to the cores of the CPU socket 0 using ``vcpupin``. | |
690 | ||
691 | - Configure NUMA cell and memory shared using ``memAccess='shared'``. | |
692 | ||
693 | - Disable ``mrg_rxbuf='off'`` | |
694 | ||
695 | Refer to the `libvirt documentation <http://libvirt.org/formatdomain.html>`__ | |
696 | for more information. | |
697 | ||
698 | vhost-user-client | |
699 | ~~~~~~~~~~~~~~~~~ | |
700 | ||
701 | 1. Install the prerequisites: | |
702 | ||
703 | - QEMU version >= 2.7 | |
704 | ||
705 | 2. Add vhost-user-client ports to the switch. | |
706 | ||
707 | Unlike vhost-user ports, the name given to port does not govern the name of | |
708 | the socket device. ``vhost-server-path`` reflects the full path of the | |
709 | socket that has been or will be created by QEMU for the given vHost User | |
710 | client port. | |
711 | ||
712 | For vhost-user-client, the name of the port type is | |
713 | ``dpdkvhostuserclient``:: | |
714 | ||
715 | $ VHOST_USER_SOCKET_PATH=/path/to/socker | |
716 | $ ovs-vsctl add-port br0 vhost-client-1 \ | |
717 | -- set Interface vhost-client-1 type=dpdkvhostuserclient \ | |
718 | options:vhost-server-path=$VHOST_USER_SOCKET_PATH | |
719 | ||
720 | 3. Add vhost-user-client ports to VM | |
721 | ||
722 | 1. Configure sockets | |
723 | ||
724 | Pass the following parameters to QEMU to attach a vhost-user device:: | |
725 | ||
726 | -chardev socket,id=char1,path=$VHOST_USER_SOCKET_PATH,server | |
727 | -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce | |
728 | -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 | |
729 | ||
730 | where ``vhost-user-1`` is the name of the vhost-user port added to the | |
731 | switch. | |
732 | ||
733 | If the corresponding dpdkvhostuserclient port has not yet been configured | |
734 | in OVS with ``vhost-server-path=/path/to/socket``, QEMU will print a log | |
735 | similar to the following:: | |
736 | ||
737 | QEMU waiting for connection on: disconnected:unix:/path/to/socket,server | |
738 | ||
739 | QEMU will wait until the port is created sucessfully in OVS to boot the VM. | |
740 | ||
741 | One benefit of using this mode is the ability for vHost ports to | |
742 | 'reconnect' in event of the switch crashing or being brought down. Once | |
743 | it is brought back up, the vHost ports will reconnect automatically and | |
744 | normal service will resume. | |
745 | ||
746 | DPDK Backend Inside VM | |
747 | ~~~~~~~~~~~~~~~~~~~~~~ | |
748 | ||
749 | Additional configuration is required if you want to run ovs-vswitchd with DPDK | |
750 | backend inside a QEMU virtual machine. Ovs-vswitchd creates separate DPDK TX | |
751 | queues for each CPU core available. This operation fails inside QEMU virtual | |
752 | machine because, by default, VirtIO NIC provided to the guest is configured to | |
753 | support only single TX queue and single RX queue. To change this behavior, you | |
754 | need to turn on ``mq`` (multiqueue) property of all ``virtio-net-pci`` devices | |
755 | emulated by QEMU and used by DPDK. You may do it manually (by changing QEMU | |
756 | command line) or, if you use Libvirt, by adding the following string to | |
757 | ``<interface>`` sections of all network devices used by DPDK:: | |
758 | ||
759 | <driver name='vhost' queues='N'/> | |
760 | ||
761 | Where: | |
762 | ||
763 | ``N`` | |
764 | determines how many queues can be used by the guest. | |
765 | ||
766 | This requires QEMU >= 2.2. | |
767 | ||
768 | QoS | |
769 | --- | |
770 | ||
771 | Assuming you have a vhost-user port transmitting traffic consisting of packets | |
772 | of size 64 bytes, the following command would limit the egress transmission | |
773 | rate of the port to ~1,000,000 packets per second:: | |
774 | ||
775 | $ ovs-vsctl set port vhost-user0 qos=@newqos -- \ | |
776 | --id=@newqos create qos type=egress-policer other-config:cir=46000000 \ | |
777 | other-config:cbs=2048` | |
778 | ||
779 | To examine the QoS configuration of the port, run:: | |
780 | ||
781 | $ ovs-appctl -t ovs-vswitchd qos/show vhost-user0 | |
782 | ||
783 | To clear the QoS configuration from the port and ovsdb, run:: | |
784 | ||
785 | $ ovs-vsctl destroy QoS vhost-user0 -- clear Port vhost-user0 qos | |
786 | ||
787 | Refer to vswitch.xml for more details on egress-policer. | |
788 | ||
789 | Rate Limiting | |
790 | -------------- | |
791 | ||
792 | Here is an example on Ingress Policing usage. Assuming you have a vhost-user | |
793 | port receiving traffic consisting of packets of size 64 bytes, the following | |
794 | command would limit the reception rate of the port to ~1,000,000 packets per | |
795 | second:: | |
796 | ||
797 | $ ovs-vsctl set interface vhost-user0 ingress_policing_rate=368000 \ | |
798 | ingress_policing_burst=1000` | |
799 | ||
800 | To examine the ingress policer configuration of the port:: | |
801 | ||
802 | $ ovs-vsctl list interface vhost-user0 | |
803 | ||
804 | To clear the ingress policer configuration from the port:: | |
805 | ||
806 | $ ovs-vsctl set interface vhost-user0 ingress_policing_rate=0 | |
807 | ||
808 | Refer to vswitch.xml for more details on ingress-policer. | |
809 | ||
810 | Flow Control | |
811 | ------------ | |
812 | ||
813 | Flow control can be enabled only on DPDK physical ports. To enable flow | |
814 | control support at tx side while adding a port, run:: | |
815 | ||
816 | $ ovs-vsctl add-port br0 dpdk0 -- \ | |
817 | set Interface dpdk0 type=dpdk options:tx-flow-ctrl=true | |
818 | ||
819 | Similarly, to enable rx flow control, run:: | |
820 | ||
821 | $ ovs-vsctl add-port br0 dpdk0 -- \ | |
822 | set Interface dpdk0 type=dpdk options:rx-flow-ctrl=true | |
823 | ||
824 | To enable flow control auto-negotiation, run:: | |
825 | ||
826 | $ ovs-vsctl add-port br0 dpdk0 -- \ | |
827 | set Interface dpdk0 type=dpdk options:flow-ctrl-autoneg=true | |
828 | ||
829 | To turn ON the tx flow control at run time(After the port is being added to | |
830 | OVS):: | |
831 | ||
832 | $ ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=true | |
833 | ||
834 | The flow control parameters can be turned off by setting ``false`` to the | |
835 | respective parameter. To disable the flow control at tx side, run:: | |
836 | ||
837 | $ ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false | |
838 | ||
839 | pdump | |
840 | ----- | |
841 | ||
842 | Pdump allows you to listen on DPDK ports and view the traffic that is passing | |
843 | on them. To use this utility, one must have libpcap installed on the system. | |
844 | Furthermore, DPDK must be built with ``CONFIG_RTE_LIBRTE_PDUMP=y`` and | |
845 | ``CONFIG_RTE_LIBRTE_PMD_PCAP=y``. | |
846 | ||
847 | .. warning:: | |
848 | A performance decrease is expected when using a monitoring application like | |
849 | the DPDK pdump app. | |
850 | ||
851 | To use pdump, simply launch OVS as usual. Then, navigate to the ``app/pdump`` | |
852 | directory in DPDK, ``make`` the application and run like so:: | |
853 | ||
854 | $ sudo ./build/app/dpdk-pdump -- \ | |
855 | --pdump port=0,queue=0,rx-dev=/tmp/pkts.pcap \ | |
856 | --server-socket-path=/usr/local/var/run/openvswitch | |
857 | ||
858 | The above command captures traffic received on queue 0 of port 0 and stores it | |
859 | in ``/tmp/pkts.pcap``. Other combinations of port numbers, queues numbers and | |
860 | pcap locations are of course also available to use. For example, to capture all | |
861 | packets that traverse port 0 in a single pcap file:: | |
862 | ||
863 | $ sudo ./build/app/dpdk-pdump -- \ | |
864 | --pdump 'port=0,queue=*,rx-dev=/tmp/pkts.pcap,tx-dev=/tmp/pkts.pcap' \ | |
865 | --server-socket-path=/usr/local/var/run/openvswitch | |
866 | ||
867 | ``server-socket-path`` must be set to the value of ovs_rundir() which typically | |
868 | resolves to ``/usr/local/var/run/openvswitch``. | |
869 | ||
870 | Many tools are available to view the contents of the pcap file. Once example is | |
871 | tcpdump. Issue the following command to view the contents of ``pkts.pcap``:: | |
872 | ||
873 | $ tcpdump -r pkts.pcap | |
874 | ||
875 | More information on the pdump app and its usage can be found in the `DPDK docs | |
876 | <http://dpdk.org/doc/guides/sample_app_ug/pdump.html>`__. | |
877 | ||
878 | Jumbo Frames | |
879 | ------------ | |
880 | ||
881 | By default, DPDK ports are configured with standard Ethernet MTU (1500B). To | |
882 | enable Jumbo Frames support for a DPDK port, change the Interface's | |
883 | ``mtu_request`` attribute to a sufficiently large value. For example, to add a | |
884 | DPDK Phy port with MTU of 9000:: | |
885 | ||
886 | $ ovs-vsctl add-port br0 dpdk0 \ | |
887 | -- set Interface dpdk0 type=dpdk \ | |
888 | -- set Interface dpdk0 mtu_request=9000` | |
889 | ||
890 | Similarly, to change the MTU of an existing port to 6200:: | |
891 | ||
892 | $ ovs-vsctl set Interface dpdk0 mtu_request=6200 | |
893 | ||
894 | Some additional configuration is needed to take advantage of jumbo frames with | |
895 | vHost ports: | |
896 | ||
897 | 1. *mergeable buffers* must be enabled for vHost ports, as demonstrated in the | |
898 | QEMU command line snippet below:: | |
899 | ||
900 | -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \ | |
901 | -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on | |
902 | ||
903 | 2. Where virtio devices are bound to the Linux kernel driver in a guest | |
904 | environment (i.e. interfaces are not bound to an in-guest DPDK driver), the | |
905 | MTU of those logical network interfaces must also be increased to a | |
906 | sufficiently large value. This avoids segmentation of Jumbo Frames received | |
907 | in the guest. Note that 'MTU' refers to the length of the IP packet only, | |
908 | and not that of the entire frame. | |
909 | ||
910 | To calculate the exact MTU of a standard IPv4 frame, subtract the L2 header | |
911 | and CRC lengths (i.e. 18B) from the max supported frame size. So, to set | |
912 | the MTU for a 9018B Jumbo Frame:: | |
913 | ||
914 | $ ifconfig eth1 mtu 9000 | |
915 | ||
916 | When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are | |
917 | increased, such that a full Jumbo Frame of a specific size may be accommodated | |
918 | within a single mbuf segment. | |
919 | ||
920 | Jumbo frame support has been validated against 9728B frames, which is the | |
921 | largest frame size supported by Fortville NIC using the DPDK i40e driver, but | |
922 | larger frames and other DPDK NIC drivers may be supported. These cases are | |
923 | common for use cases involving East-West traffic only. | |
924 | ||
925 | vsperf | |
926 | ------ | |
927 | ||
928 | The vsperf project aims to develop a vSwitch test framework that can be used to | |
929 | validate the suitability of different vSwitch implementations in a telco | |
930 | deployment environment. More information can be found on the `OPNFV wiki | |
931 | <https://wiki.opnfv.org/display/vsperf/VSperf+Home>`__. | |
932 | ||
933 | Bug Reporting | |
934 | ------------- | |
935 | ||
936 | Report problems to bugs@openvswitch.org. | |
937 | ||
938 | .. _DPDK installation guide: INSTALL.DPDK.rst |