]> git.proxmox.com Git - mirror_ovs.git/blob - INSTALL.DPDK-ADVANCED.rst
ovs-ofctl: Fix memory leak in ofctl_packet_out().
[mirror_ovs.git] / INSTALL.DPDK-ADVANCED.rst
1 ..
2 Licensed under the Apache License, Version 2.0 (the "License"); you may
3 not use this file except in compliance with the License. You may obtain
4 a copy of the License at
5
6 http://www.apache.org/licenses/LICENSE-2.0
7
8 Unless required by applicable law or agreed to in writing, software
9 distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
10 WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
11 License for the specific language governing permissions and limitations
12 under the License.
13
14 Convention for heading levels in Open vSwitch documentation:
15
16 ======= Heading 0 (reserved for the title in a document)
17 ------- Heading 1
18 ~~~~~~~ Heading 2
19 +++++++ Heading 3
20 ''''''' Heading 4
21
22 Avoid deeper levels because they do not render well.
23
24 =================================
25 Open vSwitch with DPDK (Advanced)
26 =================================
27
28 The Advanced Install Guide explains how to improve OVS performance when using
29 DPDK datapath. This guide provides information on tuning, system configuration,
30 troubleshooting, static code analysis and testcases.
31
32 Building as a Shared Library
33 ----------------------------
34
35 DPDK can be built as a static or a shared library and shall be linked by
36 applications using DPDK datapath. When building OVS with DPDK, you can link
37 Open vSwitch against the shared DPDK library.
38
39 .. note::
40 Minor performance loss is seen with OVS when using shared DPDK library as
41 compared to static library.
42
43 To build Open vSwitch using DPDK as a shared library, first refer to the `DPDK
44 installation guide`_ for download instructions for DPDK and OVS.
45
46 Once DPDK and OVS have been downloaded, you must configure the DPDK library
47 accordingly. Simply set ``CONFIG_RTE_BUILD_SHARED_LIB=y`` in
48 ``config/common_base``, then build and install DPDK. Once done, DPDK can be
49 built as usual. For example::
50
51 $ export DPDK_TARGET=x86_64-native-linuxapp-gcc
52 $ export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET
53 $ make install T=$DPDK_TARGET DESTDIR=install
54
55 Once DPDK is built, export the DPDK shared library location and setup OVS as
56 detailed in the `DPDK installation guide`_::
57
58 $ export LD_LIBRARY_PATH=$DPDK_DIR/x86_64-native-linuxapp-gcc/lib
59
60 System Configuration
61 --------------------
62
63 To achieve optimal OVS performance, the system can be configured and that
64 includes BIOS tweaks, Grub cmdline additions, better understanding of NUMA
65 nodes and apt selection of PCIe slots for NIC placement.
66
67 Recommended BIOS Settings
68 ~~~~~~~~~~~~~~~~~~~~~~~~~
69
70 .. list-table:: Recommended BIOS Settings
71 :header-rows: 1
72
73 * - Setting
74 - Value
75 * - C3 Power State
76 - Disabled
77 * - C6 Power State
78 - Disabled
79 * - MLC Streamer
80 - Enabled
81 * - MLC Spacial Prefetcher
82 - Enabled
83 * - DCU Data Prefetcher
84 - Enabled
85 * - DCA
86 - Enabled
87 * - CPU Power and Performance
88 - Performance
89 * - Memeory RAS and Performance Config -> NUMA optimized
90 - Enabled
91
92 PCIe Slot Selection
93 ~~~~~~~~~~~~~~~~~~~
94
95 The fastpath performance can be affected by factors related to the placement of
96 the NIC, such as channel speeds between PCIe slot and CPU or the proximity of
97 PCIe slot to the CPU cores running the DPDK application. Listed below are the
98 steps to identify right PCIe slot.
99
100 #. Retrieve host details using ``dmidecode``. For example::
101
102 $ dmidecode -t baseboard | grep "Product Name"
103
104 #. Download the technical specification for product listed, e.g: S2600WT2
105
106 #. Check the Product Architecture Overview on the Riser slot placement, CPU
107 sharing info and also PCIe channel speeds
108
109 For example: On S2600WT, CPU1 and CPU2 share Riser Slot 1 with Channel speed
110 between CPU1 and Riser Slot1 at 32GB/s, CPU2 and Riser Slot1 at 16GB/s.
111 Running DPDK app on CPU1 cores and NIC inserted in to Riser card Slots will
112 optimize OVS performance in this case.
113
114 #. Check the Riser Card #1 - Root Port mapping information, on the available
115 slots and individual bus speeds. In S2600WT slot 1, slot 2 has high bus
116 speeds and are potential slots for NIC placement.
117
118 Advanced Hugepage Setup
119 ~~~~~~~~~~~~~~~~~~~~~~~
120
121 Allocate and mount 1 GB hugepages.
122
123 - For persistent allocation of huge pages, add the following options to the
124 kernel bootline::
125
126 default_hugepagesz=1GB hugepagesz=1G hugepages=N
127
128 For platforms supporting multiple huge page sizes, add multiple options::
129
130 default_hugepagesz=<size> hugepagesz=<size> hugepages=N
131
132 where:
133
134 ``N``
135 number of huge pages requested
136 ``size``
137 huge page size with an optional suffix ``[kKmMgG]``
138
139 - For run-time allocation of huge pages::
140
141 $ echo N > /sys/devices/system/node/nodeX/hugepages/hugepages-1048576kB/nr_hugepages
142
143 where:
144
145 ``N``
146 number of huge pages requested
147 ``X``
148 NUMA Node
149
150 .. note::
151 For run-time allocation of 1G huge pages, Contiguous Memory Allocator
152 (``CONFIG_CMA``) has to be supported by kernel, check your Linux distro.
153
154 Now mount the huge pages, if not already done so::
155
156 $ mount -t hugetlbfs -o pagesize=1G none /dev/hugepages
157
158 Enable HyperThreading
159 ~~~~~~~~~~~~~~~~~~~~~
160
161 With HyperThreading, or SMT, enabled, a physical core appears as two logical
162 cores. SMT can be utilized to spawn worker threads on logical cores of the same
163 physical core there by saving additional cores.
164
165 With DPDK, when pinning pmd threads to logical cores, care must be taken to set
166 the correct bits of the ``pmd-cpu-mask`` to ensure that the pmd threads are
167 pinned to SMT siblings.
168
169 Take a sample system configuration, with 2 sockets, 2 * 10 core processors, HT
170 enabled. This gives us a total of 40 logical cores. To identify the physical
171 core shared by two logical cores, run::
172
173 $ cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list
174
175 where ``N`` is the logical core number.
176
177 In this example, it would show that cores ``1`` and ``21`` share the same
178 physical core., thus, the ``pmd-cpu-mask`` can be used to enable these two pmd
179 threads running on these two logical cores (one physical core) is::
180
181 $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=100002
182
183 Isolate Cores
184 ~~~~~~~~~~~~~
185
186 The ``isolcpus`` option can be used to isolate cores from the Linux scheduler.
187 The isolated cores can then be used to dedicatedly run HPC applications or
188 threads. This helps in better application performance due to zero context
189 switching and minimal cache thrashing. To run platform logic on core 0 and
190 isolate cores between 1 and 19 from scheduler, add ``isolcpus=1-19`` to GRUB
191 cmdline.
192
193 .. note::
194 It has been verified that core isolation has minimal advantage due to mature
195 Linux scheduler in some circumstances.
196
197 NUMA/Cluster-on-Die
198 ~~~~~~~~~~~~~~~~~~~
199
200 Ideally inter-NUMA datapaths should be avoided where possible as packets will
201 go across QPI and there may be a slight performance penalty when compared with
202 intra NUMA datapaths. On Intel Xeon Processor E5 v3, Cluster On Die is
203 introduced on models that have 10 cores or more. This makes it possible to
204 logically split a socket into two NUMA regions and again it is preferred where
205 possible to keep critical datapaths within the one cluster.
206
207 It is good practice to ensure that threads that are in the datapath are pinned
208 to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs responsible for
209 forwarding. If DPDK is built with ``CONFIG_RTE_LIBRTE_VHOST_NUMA=y``, vHost
210 User ports automatically detect the NUMA socket of the QEMU vCPUs and will be
211 serviced by a PMD from the same node provided a core on this node is enabled in
212 the ``pmd-cpu-mask``. ``libnuma`` packages are required for this feature.
213
214 Compiler Optimizations
215 ~~~~~~~~~~~~~~~~~~~~~~
216
217 The default compiler optimization level is ``-O2``. Changing this to more
218 aggressive compiler optimization such as ``-O3 -march=native`` with
219 gcc (verified on 5.3.1) can produce performance gains though not siginificant.
220 ``-march=native`` will produce optimized code on local machine and should be
221 used when software compilation is done on Testbed.
222
223 Performance Tuning
224 ------------------
225
226 Affinity
227 ~~~~~~~~
228
229 For superior performance, DPDK pmd threads and Qemu vCPU threads needs to be
230 affinitized accordingly.
231
232 - PMD thread Affinity
233
234 A poll mode driver (pmd) thread handles the I/O of all DPDK interfaces
235 assigned to it. A pmd thread shall poll the ports for incoming packets,
236 switch the packets and send to tx port. pmd thread is CPU bound, and needs
237 to be affinitized to isolated cores for optimum performance.
238
239 By setting a bit in the mask, a pmd thread is created and pinned to the
240 corresponding CPU core. e.g. to run a pmd thread on core 2::
241
242 $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=4
243
244 .. note::
245 pmd thread on a NUMA node is only created if there is at least one DPDK
246 interface from that NUMA node added to OVS.
247
248 - QEMU vCPU thread Affinity
249
250 A VM performing simple packet forwarding or running complex packet pipelines
251 has to ensure that the vCPU threads performing the work has as much CPU
252 occupancy as possible.
253
254 For example, on a multicore VM, multiple QEMU vCPU threads shall be spawned.
255 When the DPDK ``testpmd`` application that does packet forwarding is invoked,
256 the ``taskset`` command should be used to affinitize the vCPU threads to the
257 dedicated isolated cores on the host system.
258
259 Multiple Poll-Mode Driver Threads
260 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
261
262 With pmd multi-threading support, OVS creates one pmd thread for each NUMA node
263 by default. However, in cases where there are multiple ports/rxq's producing
264 traffic, performance can be improved by creating multiple pmd threads running
265 on separate cores. These pmd threads can share the workload by each being
266 responsible for different ports/rxq's. Assignment of ports/rxq's to pmd threads
267 is done automatically.
268
269 A set bit in the mask means a pmd thread is created and pinned to the
270 corresponding CPU core. For example, to run pmd threads on core 1 and 2::
271
272 $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=6
273
274 When using dpdk and dpdkvhostuser ports in a bi-directional VM loopback as
275 shown below, spreading the workload over 2 or 4 pmd threads shows significant
276 improvements as there will be more total CPU occupancy available::
277
278 NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1
279
280 DPDK Physical Port Rx Queues
281 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
282
283 ::
284
285 $ ovs-vsctl set Interface <DPDK interface> options:n_rxq=<integer>
286
287 The command above sets the number of rx queues for DPDK physical interface.
288 The rx queues are assigned to pmd threads on the same NUMA node in a
289 round-robin fashion.
290
291 DPDK Physical Port Queue Sizes
292 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
293
294 ::
295
296 $ ovs-vsctl set Interface dpdk0 options:n_rxq_desc=<integer>
297 $ ovs-vsctl set Interface dpdk0 options:n_txq_desc=<integer>
298
299 The command above sets the number of rx/tx descriptors that the NIC associated
300 with dpdk0 will be initialised with.
301
302 Different ``n_rxq_desc`` and ``n_txq_desc`` configurations yield different
303 benefits in terms of throughput and latency for different scenarios.
304 Generally, smaller queue sizes can have a positive impact for latency at the
305 expense of throughput. The opposite is often true for larger queue sizes.
306 Note: increasing the number of rx descriptors eg. to 4096 may have a negative
307 impact on performance due to the fact that non-vectorised DPDK rx functions may
308 be used. This is dependant on the driver in use, but is true for the commonly
309 used i40e and ixgbe DPDK drivers.
310
311 Exact Match Cache
312 ~~~~~~~~~~~~~~~~~
313
314 Each pmd thread contains one Exact Match Cache (EMC). After initial flow setup
315 in the datapath, the EMC contains a single table and provides the lowest level
316 (fastest) switching for DPDK ports. If there is a miss in the EMC then the next
317 level where switching will occur is the datapath classifier. Missing in the
318 EMC and looking up in the datapath classifier incurs a significant performance
319 penalty. If lookup misses occur in the EMC because it is too small to handle
320 the number of flows, its size can be increased. The EMC size can be modified by
321 editing the define ``EM_FLOW_HASH_SHIFT`` in ``lib/dpif-netdev.c``.
322
323 As mentioned above, an EMC is per pmd thread. An alternative way of increasing
324 the aggregate amount of possible flow entries in EMC and avoiding datapath
325 classifier lookups is to have multiple pmd threads running.
326
327 Rx Mergeable Buffers
328 ~~~~~~~~~~~~~~~~~~~~
329
330 Rx mergeable buffers is a virtio feature that allows chaining of multiple
331 virtio descriptors to handle large packet sizes. Large packets are handled by
332 reserving and chaining multiple free descriptors together. Mergeable buffer
333 support is negotiated between the virtio driver and virtio device and is
334 supported by the DPDK vhost library. This behavior is supported and enabled by
335 default, however in the case where the user knows that rx mergeable buffers are
336 not needed i.e. jumbo frames are not needed, it can be forced off by adding
337 ``mrg_rxbuf=off`` to the QEMU command line options. By not reserving multiple
338 chains of descriptors it will make more individual virtio descriptors available
339 for rx to the guest using dpdkvhost ports and this can improve performance.
340
341 OVS Testcases
342 -------------
343
344 PHY-VM-PHY (vHost Loopback)
345 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
346
347 The `DPDK installation guide`_ details steps for PHY-VM-PHY loopback testcase
348 and packet forwarding using DPDK testpmd application in the Guest VM. For users
349 wishing to do packet forwarding using kernel stack below, you need to run the
350 below commands on the guest::
351
352 $ ifconfig eth1 1.1.1.2/24
353 $ ifconfig eth2 1.1.2.2/24
354 $ systemctl stop firewalld.service
355 $ systemctl stop iptables.service
356 $ sysctl -w net.ipv4.ip_forward=1
357 $ sysctl -w net.ipv4.conf.all.rp_filter=0
358 $ sysctl -w net.ipv4.conf.eth1.rp_filter=0
359 $ sysctl -w net.ipv4.conf.eth2.rp_filter=0
360 $ route add -net 1.1.2.0/24 eth2
361 $ route add -net 1.1.1.0/24 eth1
362 $ arp -s 1.1.2.99 DE:AD:BE:EF:CA:FE
363 $ arp -s 1.1.1.99 DE:AD:BE:EF:CA:EE
364
365 PHY-VM-PHY (IVSHMEM)
366 ~~~~~~~~~~~~~~~~~~~~
367
368 IVSHMEM can also be validated using the PHY-VM-PHY configuration. To begin,
369 follow the steps described in the `DPDK installation guide`_ to create and
370 initialize the database, start ovs-vswitchd and add ``dpdk``-type devices to
371 bridge ``br0``. Once complete, follow the below steps:
372
373 1. Add DPDK ring port to the bridge::
374
375 $ ovs-vsctl add-port br0 dpdkr0 -- set Interface dpdkr0 type=dpdkr
376
377 2. Build modified QEMU
378
379 QEMU must be patched to enable IVSHMEM support::
380
381 $ cd /usr/src/
382 $ wget http://wiki.qemu.org/download/qemu-2.2.1.tar.bz2
383 $ tar -jxvf qemu-2.2.1.tar.bz2
384 $ cd /usr/src/qemu-2.2.1
385 $ wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/patches/ivshmem-qemu-2.2.1.patch
386 $ patch -p1 < ivshmem-qemu-2.2.1.patch
387 $ ./configure --target-list=x86_64-softmmu --enable-debug --extra-cflags='-g'
388 $ make -j 4
389
390 3. Generate QEMU commandline::
391
392 $ mkdir -p /usr/src/cmdline_generator
393 $ cd /usr/src/cmdline_generator
394 $ wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/cmdline_generator/cmdline_generator.c
395 $ wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/cmdline_generator/Makefile
396 $ export RTE_SDK=/usr/src/dpdk-16.07
397 $ export RTE_TARGET=x86_64-ivshmem-linuxapp-gcc
398 $ make
399 $ ./build/cmdline_generator -m -p dpdkr0 XXX
400 $ cmdline=`cat OVSMEMPOOL`
401
402 4. Start guest VM::
403
404 $ export VM_NAME=ivshmem-vm
405 $ export QCOW2_IMAGE=/root/CentOS7_x86_64.qcow2
406 $ export QEMU_BIN=/usr/src/qemu-2.2.1/x86_64-softmmu/qemu-system-x86_64
407 $ taskset 0x20 $QEMU_BIN -cpu host -smp 2,cores=2 -hda $QCOW2_IMAGE \
408 -m 4096 --enable-kvm -name $VM_NAME -nographic -vnc :2 \
409 -pidfile /tmp/vm1.pid $cmdline
410
411 5. Build and run the sample ``dpdkr`` app in VM::
412
413 $ echo 1024 > /proc/sys/vm/nr_hugepages
414 $ mount -t hugetlbfs nodev /dev/hugepages (if not already mounted)
415
416 # Build the DPDK ring application in the VM
417 $ export RTE_SDK=/root/dpdk-16.07
418 $ export RTE_TARGET=x86_64-ivshmem-linuxapp-gcc
419 $ make
420
421 # Run dpdkring application
422 $ ./build/dpdkr -c 1 -n 4 -- -n 0
423 # where "-n 0" refers to ring '0' i.e dpdkr0
424
425 PHY-VM-PHY (vHost Multiqueue)
426 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
427
428 vHost Multique functionality can also be validated using the PHY-VM-PHY
429 configuration. To begin, follow the steps described in the `DPDK installation
430 guide`_ to create and initialize the database, start ovs-vswitchd and add
431 ``dpdk``-type devices to bridge ``br0``. Once complete, follow the below steps:
432
433 1. Configure PMD and RXQs.
434
435 For example, set the number of dpdk port rx queues to at least 2 The number
436 of rx queues at vhost-user interface gets automatically configured after
437 virtio device connection and doesn't need manual configuration::
438
439 $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=c
440 $ ovs-vsctl set Interface dpdk0 options:n_rxq=2
441 $ ovs-vsctl set Interface dpdk1 options:n_rxq=2
442
443 2. Instantiate Guest VM using QEMU cmdline
444
445 We must configure with appropriate software versions to ensure this feature
446 is supported.
447
448 .. list-table:: Recommended BIOS Settings
449 :header-rows: 1
450
451 * - Setting
452 - Value
453 * - QEMU version
454 - 2.5.0
455 * - QEMU thread affinity
456 - 2 cores (taskset 0x30)
457 * - Memory
458 - 4 GB
459 * - Cores
460 - 2
461 * - Distro
462 - Fedora 22
463 * - Multiqueue
464 - Enabled
465
466 To do this, instantiate the guest as follows::
467
468 $ export VM_NAME=vhost-vm
469 $ export GUEST_MEM=4096M
470 $ export QCOW2_IMAGE=/root/Fedora22_x86_64.qcow2
471 $ export VHOST_SOCK_DIR=/usr/local/var/run/openvswitch
472 $ taskset 0x30 qemu-system-x86_64 -cpu host -smp 2,cores=2 -m 4096M \
473 -drive file=$QCOW2_IMAGE --enable-kvm -name $VM_NAME \
474 -nographic -numa node,memdev=mem -mem-prealloc \
475 -object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on \
476 -chardev socket,id=char1,path=$VHOST_SOCK_DIR/dpdkvhostuser0 \
477 -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=2 \
478 -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mq=on,vectors=6 \
479 -chardev socket,id=char2,path=$VHOST_SOCK_DIR/dpdkvhostuser1 \
480 -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=2 \
481 -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=6
482
483 .. note::
484 Queue value above should match the queues configured in OVS, The vector
485 value should be set to "number of queues x 2 + 2"
486
487 3. Configure the guest interface
488
489 Assuming there are 2 interfaces in the guest named eth0, eth1 check the
490 channel configuration and set the number of combined channels to 2 for
491 virtio devices::
492
493 $ ethtool -l eth0
494 $ ethtool -L eth0 combined 2
495 $ ethtool -L eth1 combined 2
496
497 More information can be found in vHost walkthrough section.
498
499 4. Configure kernel packet forwarding
500
501 Configure IP and enable interfaces::
502
503 $ ifconfig eth0 5.5.5.1/24 up
504 $ ifconfig eth1 90.90.90.1/24 up
505
506 Configure IP forwarding and add route entries::
507
508 $ sysctl -w net.ipv4.ip_forward=1
509 $ sysctl -w net.ipv4.conf.all.rp_filter=0
510 $ sysctl -w net.ipv4.conf.eth0.rp_filter=0
511 $ sysctl -w net.ipv4.conf.eth1.rp_filter=0
512 $ ip route add 2.1.1.0/24 dev eth1
513 $ route add default gw 2.1.1.2 eth1
514 $ route add default gw 90.90.90.90 eth1
515 $ arp -s 90.90.90.90 DE:AD:BE:EF:CA:FE
516 $ arp -s 2.1.1.2 DE:AD:BE:EF:CA:FA
517
518 Check traffic on multiple queues::
519
520 $ cat /proc/interrupts | grep virtio
521
522 vHost Walkthrough
523 -----------------
524
525 Two types of vHost User ports are available in OVS:
526
527 - vhost-user (``dpdkvhostuser``)
528
529 - vhost-user-client (``dpdkvhostuserclient``)
530
531 vHost User uses a client-server model. The server creates/manages/destroys the
532 vHost User sockets, and the client connects to the server. Depending on which
533 port type you use, ``dpdkvhostuser`` or ``dpdkvhostuserclient``, a different
534 configuration of the client-server model is used.
535
536 For vhost-user ports, Open vSwitch acts as the server and QEMU the client. For
537 vhost-user-client ports, Open vSwitch acts as the client and QEMU the server.
538
539 vhost-user
540 ~~~~~~~~~~
541
542 1. Install the prerequisites:
543
544 - QEMU version >= 2.2
545
546 2. Add vhost-user ports to the switch.
547
548 Unlike DPDK ring ports, DPDK vhost-user ports can have arbitrary names,
549 except that forward and backward slashes are prohibited in the names.
550
551 For vhost-user, the name of the port type is ``dpdkvhostuser``::
552
553 $ ovs-vsctl add-port br0 vhost-user-1 -- set Interface vhost-user-1 \
554 type=dpdkvhostuser
555
556 This action creates a socket located at
557 ``/usr/local/var/run/openvswitch/vhost-user-1``, which you must provide to
558 your VM on the QEMU command line. More instructions on this can be found in
559 the next section "Adding vhost-user ports to VM"
560
561 .. note::
562 If you wish for the vhost-user sockets to be created in a sub-directory of
563 ``/usr/local/var/run/openvswitch``, you may specify this directory in the
564 ovsdb like so::
565
566 $ ovs-vsctl --no-wait \
567 set Open_vSwitch . other_config:vhost-sock-dir=subdir`
568
569 3. Add vhost-user ports to VM
570
571 1. Configure sockets
572
573 Pass the following parameters to QEMU to attach a vhost-user device::
574
575 -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1
576 -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
577 -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1
578
579 where ``vhost-user-1`` is the name of the vhost-user port added to the
580 switch.
581
582 Repeat the above parameters for multiple devices, changing the chardev
583 ``path`` and ``id`` as necessary. Note that a separate and different
584 chardev ``path`` needs to be specified for each vhost-user device. For
585 example you have a second vhost-user port named ``vhost-user-2``, you
586 append your QEMU command line with an additional set of parameters::
587
588 -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2
589 -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce
590 -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2
591
592 2. Configure hugepages
593
594 QEMU must allocate the VM's memory on hugetlbfs. vhost-user ports access
595 a virtio-net device's virtual rings and packet buffers mapping the VM's
596 physical memory on hugetlbfs. To enable vhost-user ports to map the VM's
597 memory into their process address space, pass the following parameters
598 to QEMU::
599
600 -object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on
601 -numa node,memdev=mem -mem-prealloc
602
603 3. Enable multiqueue support (optional)
604
605 QEMU needs to be configured to use multiqueue::
606
607 -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2
608 -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=$q
609 -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=$v
610
611 where:
612
613 ``$q``
614 The number of queues
615 ``$v``
616 The number of vectors, which is ``$q`` * 2 + 2
617
618 The vhost-user interface will be automatically reconfigured with
619 required number of rx and tx queues after connection of virtio device.
620 Manual configuration of ``n_rxq`` is not supported because OVS will work
621 properly only if ``n_rxq`` will match number of queues configured in
622 QEMU.
623
624 A least 2 PMDs should be configured for the vswitch when using
625 multiqueue. Using a single PMD will cause traffic to be enqueued to the
626 same vhost queue rather than being distributed among different vhost
627 queues for a vhost-user interface.
628
629 If traffic destined for a VM configured with multiqueue arrives to the
630 vswitch via a physical DPDK port, then the number of rxqs should also be
631 set to at least 2 for that physical DPDK port. This is required to
632 increase the probability that a different PMD will handle the multiqueue
633 transmission to the guest using a different vhost queue.
634
635 If one wishes to use multiple queues for an interface in the guest, the
636 driver in the guest operating system must be configured to do so. It is
637 recommended that the number of queues configured be equal to ``$q``.
638
639 For example, this can be done for the Linux kernel virtio-net driver
640 with::
641
642 $ ethtool -L <DEV> combined <$q>
643
644 where:
645
646 ``-L``
647 Changes the numbers of channels of the specified network device
648 ``combined``
649 Changes the number of multi-purpose channels.
650
651 Configure the VM using libvirt
652 ++++++++++++++++++++++++++++++
653
654 You can also build and configure the VM using libvirt rather than QEMU by
655 itself.
656
657 1. Change the user/group, access control policty and restart libvirtd.
658
659 - In ``/etc/libvirt/qemu.conf`` add/edit the following lines::
660
661 user = "root"
662 group = "root"
663
664 - Disable SELinux or set to permissive mode::
665
666 $ setenforce 0
667
668 - Restart the libvirtd process, For example, on Fedora::
669
670 $ systemctl restart libvirtd.service
671
672 2. Instantiate the VM
673
674 - Copy the XML configuration described in the `DPDK installation guide`_.
675
676 - Start the VM::
677
678 $ virsh create demovm.xml
679
680 - Connect to the guest console::
681
682 $ virsh console demovm
683
684 3. Configure the VM
685
686 The demovm xml configuration is aimed at achieving out of box performance on
687 VM.
688
689 - The vcpus are pinned to the cores of the CPU socket 0 using ``vcpupin``.
690
691 - Configure NUMA cell and memory shared using ``memAccess='shared'``.
692
693 - Disable ``mrg_rxbuf='off'``
694
695 Refer to the `libvirt documentation <http://libvirt.org/formatdomain.html>`__
696 for more information.
697
698 vhost-user-client
699 ~~~~~~~~~~~~~~~~~
700
701 1. Install the prerequisites:
702
703 - QEMU version >= 2.7
704
705 2. Add vhost-user-client ports to the switch.
706
707 Unlike vhost-user ports, the name given to port does not govern the name of
708 the socket device. ``vhost-server-path`` reflects the full path of the
709 socket that has been or will be created by QEMU for the given vHost User
710 client port.
711
712 For vhost-user-client, the name of the port type is
713 ``dpdkvhostuserclient``::
714
715 $ VHOST_USER_SOCKET_PATH=/path/to/socker
716 $ ovs-vsctl add-port br0 vhost-client-1 \
717 -- set Interface vhost-client-1 type=dpdkvhostuserclient \
718 options:vhost-server-path=$VHOST_USER_SOCKET_PATH
719
720 3. Add vhost-user-client ports to VM
721
722 1. Configure sockets
723
724 Pass the following parameters to QEMU to attach a vhost-user device::
725
726 -chardev socket,id=char1,path=$VHOST_USER_SOCKET_PATH,server
727 -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
728 -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1
729
730 where ``vhost-user-1`` is the name of the vhost-user port added to the
731 switch.
732
733 If the corresponding dpdkvhostuserclient port has not yet been configured
734 in OVS with ``vhost-server-path=/path/to/socket``, QEMU will print a log
735 similar to the following::
736
737 QEMU waiting for connection on: disconnected:unix:/path/to/socket,server
738
739 QEMU will wait until the port is created sucessfully in OVS to boot the VM.
740
741 One benefit of using this mode is the ability for vHost ports to
742 'reconnect' in event of the switch crashing or being brought down. Once
743 it is brought back up, the vHost ports will reconnect automatically and
744 normal service will resume.
745
746 DPDK Backend Inside VM
747 ~~~~~~~~~~~~~~~~~~~~~~
748
749 Additional configuration is required if you want to run ovs-vswitchd with DPDK
750 backend inside a QEMU virtual machine. Ovs-vswitchd creates separate DPDK TX
751 queues for each CPU core available. This operation fails inside QEMU virtual
752 machine because, by default, VirtIO NIC provided to the guest is configured to
753 support only single TX queue and single RX queue. To change this behavior, you
754 need to turn on ``mq`` (multiqueue) property of all ``virtio-net-pci`` devices
755 emulated by QEMU and used by DPDK. You may do it manually (by changing QEMU
756 command line) or, if you use Libvirt, by adding the following string to
757 ``<interface>`` sections of all network devices used by DPDK::
758
759 <driver name='vhost' queues='N'/>
760
761 Where:
762
763 ``N``
764 determines how many queues can be used by the guest.
765
766 This requires QEMU >= 2.2.
767
768 QoS
769 ---
770
771 Assuming you have a vhost-user port transmitting traffic consisting of packets
772 of size 64 bytes, the following command would limit the egress transmission
773 rate of the port to ~1,000,000 packets per second::
774
775 $ ovs-vsctl set port vhost-user0 qos=@newqos -- \
776 --id=@newqos create qos type=egress-policer other-config:cir=46000000 \
777 other-config:cbs=2048`
778
779 To examine the QoS configuration of the port, run::
780
781 $ ovs-appctl -t ovs-vswitchd qos/show vhost-user0
782
783 To clear the QoS configuration from the port and ovsdb, run::
784
785 $ ovs-vsctl destroy QoS vhost-user0 -- clear Port vhost-user0 qos
786
787 Refer to vswitch.xml for more details on egress-policer.
788
789 Rate Limiting
790 --------------
791
792 Here is an example on Ingress Policing usage. Assuming you have a vhost-user
793 port receiving traffic consisting of packets of size 64 bytes, the following
794 command would limit the reception rate of the port to ~1,000,000 packets per
795 second::
796
797 $ ovs-vsctl set interface vhost-user0 ingress_policing_rate=368000 \
798 ingress_policing_burst=1000`
799
800 To examine the ingress policer configuration of the port::
801
802 $ ovs-vsctl list interface vhost-user0
803
804 To clear the ingress policer configuration from the port::
805
806 $ ovs-vsctl set interface vhost-user0 ingress_policing_rate=0
807
808 Refer to vswitch.xml for more details on ingress-policer.
809
810 Flow Control
811 ------------
812
813 Flow control can be enabled only on DPDK physical ports. To enable flow
814 control support at tx side while adding a port, run::
815
816 $ ovs-vsctl add-port br0 dpdk0 -- \
817 set Interface dpdk0 type=dpdk options:tx-flow-ctrl=true
818
819 Similarly, to enable rx flow control, run::
820
821 $ ovs-vsctl add-port br0 dpdk0 -- \
822 set Interface dpdk0 type=dpdk options:rx-flow-ctrl=true
823
824 To enable flow control auto-negotiation, run::
825
826 $ ovs-vsctl add-port br0 dpdk0 -- \
827 set Interface dpdk0 type=dpdk options:flow-ctrl-autoneg=true
828
829 To turn ON the tx flow control at run time(After the port is being added to
830 OVS)::
831
832 $ ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=true
833
834 The flow control parameters can be turned off by setting ``false`` to the
835 respective parameter. To disable the flow control at tx side, run::
836
837 $ ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false
838
839 pdump
840 -----
841
842 Pdump allows you to listen on DPDK ports and view the traffic that is passing
843 on them. To use this utility, one must have libpcap installed on the system.
844 Furthermore, DPDK must be built with ``CONFIG_RTE_LIBRTE_PDUMP=y`` and
845 ``CONFIG_RTE_LIBRTE_PMD_PCAP=y``.
846
847 .. warning::
848 A performance decrease is expected when using a monitoring application like
849 the DPDK pdump app.
850
851 To use pdump, simply launch OVS as usual. Then, navigate to the ``app/pdump``
852 directory in DPDK, ``make`` the application and run like so::
853
854 $ sudo ./build/app/dpdk-pdump -- \
855 --pdump port=0,queue=0,rx-dev=/tmp/pkts.pcap \
856 --server-socket-path=/usr/local/var/run/openvswitch
857
858 The above command captures traffic received on queue 0 of port 0 and stores it
859 in ``/tmp/pkts.pcap``. Other combinations of port numbers, queues numbers and
860 pcap locations are of course also available to use. For example, to capture all
861 packets that traverse port 0 in a single pcap file::
862
863 $ sudo ./build/app/dpdk-pdump -- \
864 --pdump 'port=0,queue=*,rx-dev=/tmp/pkts.pcap,tx-dev=/tmp/pkts.pcap' \
865 --server-socket-path=/usr/local/var/run/openvswitch
866
867 ``server-socket-path`` must be set to the value of ovs_rundir() which typically
868 resolves to ``/usr/local/var/run/openvswitch``.
869
870 Many tools are available to view the contents of the pcap file. Once example is
871 tcpdump. Issue the following command to view the contents of ``pkts.pcap``::
872
873 $ tcpdump -r pkts.pcap
874
875 More information on the pdump app and its usage can be found in the `DPDK docs
876 <http://dpdk.org/doc/guides/sample_app_ug/pdump.html>`__.
877
878 Jumbo Frames
879 ------------
880
881 By default, DPDK ports are configured with standard Ethernet MTU (1500B). To
882 enable Jumbo Frames support for a DPDK port, change the Interface's
883 ``mtu_request`` attribute to a sufficiently large value. For example, to add a
884 DPDK Phy port with MTU of 9000::
885
886 $ ovs-vsctl add-port br0 dpdk0 \
887 -- set Interface dpdk0 type=dpdk \
888 -- set Interface dpdk0 mtu_request=9000`
889
890 Similarly, to change the MTU of an existing port to 6200::
891
892 $ ovs-vsctl set Interface dpdk0 mtu_request=6200
893
894 Some additional configuration is needed to take advantage of jumbo frames with
895 vHost ports:
896
897 1. *mergeable buffers* must be enabled for vHost ports, as demonstrated in the
898 QEMU command line snippet below::
899
900 -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
901 -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on
902
903 2. Where virtio devices are bound to the Linux kernel driver in a guest
904 environment (i.e. interfaces are not bound to an in-guest DPDK driver), the
905 MTU of those logical network interfaces must also be increased to a
906 sufficiently large value. This avoids segmentation of Jumbo Frames received
907 in the guest. Note that 'MTU' refers to the length of the IP packet only,
908 and not that of the entire frame.
909
910 To calculate the exact MTU of a standard IPv4 frame, subtract the L2 header
911 and CRC lengths (i.e. 18B) from the max supported frame size. So, to set
912 the MTU for a 9018B Jumbo Frame::
913
914 $ ifconfig eth1 mtu 9000
915
916 When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are
917 increased, such that a full Jumbo Frame of a specific size may be accommodated
918 within a single mbuf segment.
919
920 Jumbo frame support has been validated against 9728B frames, which is the
921 largest frame size supported by Fortville NIC using the DPDK i40e driver, but
922 larger frames and other DPDK NIC drivers may be supported. These cases are
923 common for use cases involving East-West traffic only.
924
925 vsperf
926 ------
927
928 The vsperf project aims to develop a vSwitch test framework that can be used to
929 validate the suitability of different vSwitch implementations in a telco
930 deployment environment. More information can be found on the `OPNFV wiki
931 <https://wiki.opnfv.org/display/vsperf/VSperf+Home>`__.
932
933 Bug Reporting
934 -------------
935
936 Report problems to bugs@openvswitch.org.
937
938 .. _DPDK installation guide: INSTALL.DPDK.rst