2 Licensed under the Apache License, Version 2.0 (the "License"); you may
3 not use this file except in compliance with the License. You may obtain
4 a copy of the License at
6 http://www.apache.org/licenses/LICENSE-2.0
8 Unless required by applicable law or agreed to in writing, software
9 distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
10 WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
11 License for the specific language governing permissions and limitations
14 Convention for heading levels in Open vSwitch documentation:
16 ======= Heading 0 (reserved for the title in a document)
22 Avoid deeper levels because they do not render well.
24 ======================
25 Open vSwitch with DPDK
26 ======================
28 This document describes how to build and install Open vSwitch using a DPDK
29 datapath. Open vSwitch can use the DPDK library to operate entirely in
33 The DPDK support of Open vSwitch is considered 'experimental'.
38 In addition to the requirements described in :doc:`general`, building Open
39 vSwitch with DPDK will require the following:
43 - A `DPDK supported NIC`_
45 Only required when physical ports are in use
49 On Linux Distros running kernel version >= 3.0, only `IOMMU` needs to enabled
50 via the grub cmdline, assuming you are using **VFIO**. For older kernels,
51 ensure the kernel is built with ``UIO``, ``HUGETLBFS``,
52 ``PROC_PAGE_MONITOR``, ``HPET``, ``HPET_MMAP`` support. If these are not
53 present, it will be necessary to upgrade your kernel or build a custom kernel
54 with these flags enabled.
56 Detailed system requirements can be found at `DPDK requirements`_.
58 .. _DPDK supported NIC: http://dpdk.org/doc/nics
59 .. _DPDK requirements: http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html
67 #. Download the `DPDK sources`_, extract the file and set ``DPDK_DIR``::
70 $ wget http://fast.dpdk.org/rel/dpdk-16.11.tar.xz
71 $ tar xf dpdk-16.11.tar.xz
72 $ export DPDK_DIR=/usr/src/dpdk-16.11
75 #. (Optional) Configure DPDK as a shared library
77 DPDK can be built as either a static library or a shared library. By
78 default, it is configured for the former. If you wish to use the latter, set
79 ``CONFIG_RTE_BUILD_SHARED_LIB=y`` in ``$DPDK_DIR/config/common_base``.
83 Minor performance loss is expected when using OVS with a shared DPDK
84 library compared to a static DPDK library.
86 #. Configure and install DPDK
88 Build and install the DPDK library::
90 $ export DPDK_TARGET=x86_64-native-linuxapp-gcc
91 $ export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET
92 $ make install T=$DPDK_TARGET DESTDIR=install
94 If IVSHMEM support is required, use a different target::
96 $ export DPDK_TARGET=x86_64-ivshmem-linuxapp-gcc
98 #. (Optional) Export the DPDK shared library location
100 If DPDK was built as a shared library, export the path to this library for
101 use when building OVS::
103 $ export LD_LIBRARY_PATH=$DPDK_DIR/x86_64-native-linuxapp-gcc/lib
105 .. _DPDK sources: http://dpdk.org/rel
110 OVS can be installed using different methods. For OVS to use DPDK datapath, it
111 has to be configured with DPDK support (``--with-dpdk``).
114 This section focuses on generic recipe that suits most cases. For
115 distribution specific instructions, refer to one of the more relevant guides.
117 .. _OVS sources: http://openvswitch.org/releases/
119 #. Ensure the standard OVS requirements, described in
120 :ref:`general-build-reqs`, are installed
122 #. Bootstrap, if required, as described in :ref:`general-bootstrapping`
124 #. Configure the package using the ``--with-dpdk`` flag::
126 $ ./configure --with-dpdk=$DPDK_BUILD
128 where ``DPDK_BUILD`` is the path to the built DPDK library. This can be
129 skipped if DPDK library is installed in its default location
132 While ``--with-dpdk`` is required, you can pass any other configuration
133 option described in :ref:`general-configuring`.
135 #. Build and install OVS, as described in :ref:`general-building`
137 Additional information can be found in :doc:`general`.
145 Allocate a number of 2M Huge pages:
147 - For persistent allocation of huge pages, write to hugepages.conf file
150 $ echo 'vm.nr_hugepages=2048' > /etc/sysctl.d/hugepages.conf
152 - For run-time allocation of huge pages, use the ``sysctl`` utility::
154 $ sysctl -w vm.nr_hugepages=N # where N = No. of 2M huge pages
156 To verify hugepage configuration::
158 $ grep HugePages_ /proc/meminfo
160 Mount the hugepages, if not already mounted by default::
162 $ mount -t hugetlbfs none /dev/hugepages``
166 Setup DPDK devices using VFIO
167 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
169 VFIO is prefered to the UIO driver when using recent versions of DPDK. VFIO
170 support required support from both the kernel and BIOS. For the former, kernel
171 version > 3.6 must be used. For the latter, you must enable VT-d in the BIOS
172 and ensure this is configured via grub. To ensure VT-d is enabled via the BIOS,
175 $ dmesg | grep -e DMAR -e IOMMU
177 If VT-d is not enabled in the BIOS, enable it now.
179 To ensure VT-d is enabled in the kernel, run::
181 $ cat /proc/cmdline | grep iommu=pt
182 $ cat /proc/cmdline | grep intel_iommu=on
184 If VT-d is not enabled in the kernel, enable it now.
186 Once VT-d is correctly configured, load the required modules and bind the NIC
190 $ /usr/bin/chmod a+x /dev/vfio
191 $ /usr/bin/chmod 0666 /dev/vfio/*
192 $ $DPDK_DIR/tools/dpdk-devbind.py --bind=vfio-pci eth1
193 $ $DPDK_DIR/tools/dpdk-devbind.py --status
198 Open vSwitch should be started as described in :doc:`general` with the
199 exception of ovs-vswitchd, which requires some special configuration to enable
200 DPDK functionality. DPDK configuration arguments can be passed to ovs-vswitchd
201 via the ``other_config`` column of the ``Open_vSwitch`` table. At a minimum,
202 the ``dpdk-init`` option must be set to ``true``. For example::
204 $ export DB_SOCK=/usr/local/var/run/openvswitch/db.sock
205 $ ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
206 $ ovs-vswitchd unix:$DB_SOCK --pidfile --detach
208 There are many other configuration options, the most important of which are
209 listed below. Defaults will be provided for all values not explicitly set.
212 Specifies whether OVS should initialize and support DPDK ports. This is a
213 boolean, and defaults to false.
216 Specifies the CPU cores on which dpdk lcore threads should be spawned and
217 expects hex string (eg '0x123').
220 Comma separated list of memory to pre-allocate from hugepages on specific
223 ``dpdk-hugepage-dir``
224 Directory where hugetlbfs is mounted
227 Option to set the path to the vhost-user unix socket files.
229 If allocating more than one GB hugepage (as for IVSHMEM), you can configure the
230 amount of memory used from any given NUMA nodes. For example, to use 1GB from
233 $ ovs-vsctl --no-wait set Open_vSwitch . \
234 other_config:dpdk-socket-mem="1024,0"
236 Similarly, if you wish to better scale the workloads across cores, then
237 multiple pmd threads can be created and pinned to CPU cores by explicity
238 specifying ``pmd-cpu-mask``. Cores are numbered from 0, so to spawn two pmd
239 threads and pin them to cores 1,2, run::
241 $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x6
243 For details on using IVSHMEM with DPDK, refer to :doc:`/topics/dpdk/ivshmem`.
245 Refer to ovs-vswitchd.conf.db(5) for additional information on configuration
249 Changing any of these options requires restarting the ovs-vswitchd
255 At this point you can use ovs-vsctl to set up bridges and other Open vSwitch
256 features. Seeing as we've configured the DPDK datapath, we will use DPDK-type
257 ports. For example, to create a userspace bridge named ``br0`` and add two
258 ``dpdk`` ports to it, run::
260 $ ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
261 $ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
262 $ ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk
264 Refer to ovs-vsctl(8) and :doc:`/howto/dpdk` for more details.
269 To achieve optimal OVS performance, the system can be configured and that
270 includes BIOS tweaks, Grub cmdline additions, better understanding of NUMA
271 nodes and apt selection of PCIe slots for NIC placement.
275 This section is optional. Once installed as described above, OVS with DPDK
276 will work out of the box.
278 Recommended BIOS Settings
279 ~~~~~~~~~~~~~~~~~~~~~~~~~
281 .. list-table:: Recommended BIOS Settings
292 * - MLC Spacial Prefetcher
294 * - DCU Data Prefetcher
298 * - CPU Power and Performance
300 * - Memeory RAS and Performance Config -> NUMA optimized
306 The fastpath performance can be affected by factors related to the placement of
307 the NIC, such as channel speeds between PCIe slot and CPU or the proximity of
308 PCIe slot to the CPU cores running the DPDK application. Listed below are the
309 steps to identify right PCIe slot.
311 #. Retrieve host details using ``dmidecode``. For example::
313 $ dmidecode -t baseboard | grep "Product Name"
315 #. Download the technical specification for product listed, e.g: S2600WT2
317 #. Check the Product Architecture Overview on the Riser slot placement, CPU
318 sharing info and also PCIe channel speeds
320 For example: On S2600WT, CPU1 and CPU2 share Riser Slot 1 with Channel speed
321 between CPU1 and Riser Slot1 at 32GB/s, CPU2 and Riser Slot1 at 16GB/s.
322 Running DPDK app on CPU1 cores and NIC inserted in to Riser card Slots will
323 optimize OVS performance in this case.
325 #. Check the Riser Card #1 - Root Port mapping information, on the available
326 slots and individual bus speeds. In S2600WT slot 1, slot 2 has high bus
327 speeds and are potential slots for NIC placement.
329 Advanced Hugepage Setup
330 ~~~~~~~~~~~~~~~~~~~~~~~
332 Allocate and mount 1 GB hugepages.
334 - For persistent allocation of huge pages, add the following options to the
337 default_hugepagesz=1GB hugepagesz=1G hugepages=N
339 For platforms supporting multiple huge page sizes, add multiple options::
341 default_hugepagesz=<size> hugepagesz=<size> hugepages=N
346 number of huge pages requested
348 huge page size with an optional suffix ``[kKmMgG]``
350 - For run-time allocation of huge pages::
352 $ echo N > /sys/devices/system/node/nodeX/hugepages/hugepages-1048576kB/nr_hugepages
357 number of huge pages requested
362 For run-time allocation of 1G huge pages, Contiguous Memory Allocator
363 (``CONFIG_CMA``) has to be supported by kernel, check your Linux distro.
365 Now mount the huge pages, if not already done so::
367 $ mount -t hugetlbfs -o pagesize=1G none /dev/hugepages
369 Enable HyperThreading
370 ~~~~~~~~~~~~~~~~~~~~~
372 With HyperThreading, or SMT, enabled, a physical core appears as two logical
373 cores. SMT can be utilized to spawn worker threads on logical cores of the same
374 physical core there by saving additional cores.
376 With DPDK, when pinning pmd threads to logical cores, care must be taken to set
377 the correct bits of the ``pmd-cpu-mask`` to ensure that the pmd threads are
378 pinned to SMT siblings.
380 Take a sample system configuration, with 2 sockets, 2 * 10 core processors, HT
381 enabled. This gives us a total of 40 logical cores. To identify the physical
382 core shared by two logical cores, run::
384 $ cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list
386 where ``N`` is the logical core number.
388 In this example, it would show that cores ``1`` and ``21`` share the same
389 physical core. As cores are counted from 0, the ``pmd-cpu-mask`` can be used
390 to enable these two pmd threads running on these two logical cores (one
393 $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x200002
398 The ``isolcpus`` option can be used to isolate cores from the Linux scheduler.
399 The isolated cores can then be used to dedicatedly run HPC applications or
400 threads. This helps in better application performance due to zero context
401 switching and minimal cache thrashing. To run platform logic on core 0 and
402 isolate cores between 1 and 19 from scheduler, add ``isolcpus=1-19`` to GRUB
406 It has been verified that core isolation has minimal advantage due to mature
407 Linux scheduler in some circumstances.
412 Ideally inter-NUMA datapaths should be avoided where possible as packets will
413 go across QPI and there may be a slight performance penalty when compared with
414 intra NUMA datapaths. On Intel Xeon Processor E5 v3, Cluster On Die is
415 introduced on models that have 10 cores or more. This makes it possible to
416 logically split a socket into two NUMA regions and again it is preferred where
417 possible to keep critical datapaths within the one cluster.
419 It is good practice to ensure that threads that are in the datapath are pinned
420 to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs responsible for
421 forwarding. If DPDK is built with ``CONFIG_RTE_LIBRTE_VHOST_NUMA=y``, vHost
422 User ports automatically detect the NUMA socket of the QEMU vCPUs and will be
423 serviced by a PMD from the same node provided a core on this node is enabled in
424 the ``pmd-cpu-mask``. ``libnuma`` packages are required for this feature.
426 Compiler Optimizations
427 ~~~~~~~~~~~~~~~~~~~~~~
429 The default compiler optimization level is ``-O2``. Changing this to more
430 aggressive compiler optimization such as ``-O3 -march=native`` with
431 gcc (verified on 5.3.1) can produce performance gains though not siginificant.
432 ``-march=native`` will produce optimized code on local machine and should be
433 used when software compilation is done on Testbed.
438 For superior performance, DPDK pmd threads and Qemu vCPU threads needs to be
439 affinitized accordingly.
441 - PMD thread Affinity
443 A poll mode driver (pmd) thread handles the I/O of all DPDK interfaces
444 assigned to it. A pmd thread shall poll the ports for incoming packets,
445 switch the packets and send to tx port. pmd thread is CPU bound, and needs
446 to be affinitized to isolated cores for optimum performance.
448 By setting a bit in the mask, a pmd thread is created and pinned to the
449 corresponding CPU core. e.g. to run a pmd thread on core 2::
451 $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
454 pmd thread on a NUMA node is only created if there is at least one DPDK
455 interface from that NUMA node added to OVS.
457 - QEMU vCPU thread Affinity
459 A VM performing simple packet forwarding or running complex packet pipelines
460 has to ensure that the vCPU threads performing the work has as much CPU
461 occupancy as possible.
463 For example, on a multicore VM, multiple QEMU vCPU threads shall be spawned.
464 When the DPDK ``testpmd`` application that does packet forwarding is invoked,
465 the ``taskset`` command should be used to affinitize the vCPU threads to the
466 dedicated isolated cores on the host system.
468 Multiple Poll-Mode Driver Threads
469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
471 With pmd multi-threading support, OVS creates one pmd thread for each NUMA node
472 by default. However, in cases where there are multiple ports/rxq's producing
473 traffic, performance can be improved by creating multiple pmd threads running
474 on separate cores. These pmd threads can share the workload by each being
475 responsible for different ports/rxq's. Assignment of ports/rxq's to pmd threads
476 is done automatically.
478 A set bit in the mask means a pmd thread is created and pinned to the
479 corresponding CPU core. For example, to run pmd threads on core 1 and 2::
481 $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x6
483 When using dpdk and dpdkvhostuser ports in a bi-directional VM loopback as
484 shown below, spreading the workload over 2 or 4 pmd threads shows significant
485 improvements as there will be more total CPU occupancy available::
487 NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1
489 DPDK Physical Port Rx Queues
490 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
494 $ ovs-vsctl set Interface <DPDK interface> options:n_rxq=<integer>
496 The above command sets the number of rx queues for DPDK physical interface.
497 The rx queues are assigned to pmd threads on the same NUMA node in a
500 DPDK Physical Port Queue Sizes
501 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
505 $ ovs-vsctl set Interface dpdk0 options:n_rxq_desc=<integer>
506 $ ovs-vsctl set Interface dpdk0 options:n_txq_desc=<integer>
508 The above command sets the number of rx/tx descriptors that the NIC associated
509 with dpdk0 will be initialised with.
511 Different ``n_rxq_desc`` and ``n_txq_desc`` configurations yield different
512 benefits in terms of throughput and latency for different scenarios.
513 Generally, smaller queue sizes can have a positive impact for latency at the
514 expense of throughput. The opposite is often true for larger queue sizes.
515 Note: increasing the number of rx descriptors eg. to 4096 may have a negative
516 impact on performance due to the fact that non-vectorised DPDK rx functions may
517 be used. This is dependent on the driver in use, but is true for the commonly
518 used i40e and ixgbe DPDK drivers.
523 Each pmd thread contains one Exact Match Cache (EMC). After initial flow setup
524 in the datapath, the EMC contains a single table and provides the lowest level
525 (fastest) switching for DPDK ports. If there is a miss in the EMC then the next
526 level where switching will occur is the datapath classifier. Missing in the
527 EMC and looking up in the datapath classifier incurs a significant performance
528 penalty. If lookup misses occur in the EMC because it is too small to handle
529 the number of flows, its size can be increased. The EMC size can be modified by
530 editing the define ``EM_FLOW_HASH_SHIFT`` in ``lib/dpif-netdev.c``.
532 As mentioned above, an EMC is per pmd thread. An alternative way of increasing
533 the aggregate amount of possible flow entries in EMC and avoiding datapath
534 classifier lookups is to have multiple pmd threads running.
539 Rx mergeable buffers is a virtio feature that allows chaining of multiple
540 virtio descriptors to handle large packet sizes. Large packets are handled by
541 reserving and chaining multiple free descriptors together. Mergeable buffer
542 support is negotiated between the virtio driver and virtio device and is
543 supported by the DPDK vhost library. This behavior is supported and enabled by
544 default, however in the case where the user knows that rx mergeable buffers are
545 not needed i.e. jumbo frames are not needed, it can be forced off by adding
546 ``mrg_rxbuf=off`` to the QEMU command line options. By not reserving multiple
547 chains of descriptors it will make more individual virtio descriptors available
548 for rx to the guest using dpdkvhost ports and this can improve performance.
553 - Currently DPDK ports does not use HW offload functionality.
554 - Network Interface Firmware requirements: Each release of DPDK is
555 validated against a specific firmware version for a supported Network
556 Interface. New firmware versions introduce bug fixes, performance
557 improvements and new functionality that DPDK leverages. The validated
558 firmware versions are available as part of the release notes for
559 DPDK. It is recommended that users update Network Interface firmware
560 to match what has been validated for the DPDK release.
562 The latest list of validated firmware versions can be found in the `DPDK
565 .. _DPDK release notes: http://dpdk.org/doc/guides/rel_notes/release_16_11.html
570 Report problems to bugs@openvswitch.org.