]> git.proxmox.com Git - ovs.git/blob - Documentation/intro/install/dpdk.rst
54d56ec411d33d3ce6491002a945431f6441ae8d
[ovs.git] / Documentation / intro / install / dpdk.rst
1 ..
2 Licensed under the Apache License, Version 2.0 (the "License"); you may
3 not use this file except in compliance with the License. You may obtain
4 a copy of the License at
5
6 http://www.apache.org/licenses/LICENSE-2.0
7
8 Unless required by applicable law or agreed to in writing, software
9 distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
10 WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
11 License for the specific language governing permissions and limitations
12 under the License.
13
14 Convention for heading levels in Open vSwitch documentation:
15
16 ======= Heading 0 (reserved for the title in a document)
17 ------- Heading 1
18 ~~~~~~~ Heading 2
19 +++++++ Heading 3
20 ''''''' Heading 4
21
22 Avoid deeper levels because they do not render well.
23
24 ======================
25 Open vSwitch with DPDK
26 ======================
27
28 This document describes how to build and install Open vSwitch using a DPDK
29 datapath. Open vSwitch can use the DPDK library to operate entirely in
30 userspace.
31
32 .. warning::
33 The DPDK support of Open vSwitch is considered 'experimental'.
34
35 Build requirements
36 ------------------
37
38 In addition to the requirements described in :doc:`general`, building Open
39 vSwitch with DPDK will require the following:
40
41 - DPDK 16.11
42
43 - A `DPDK supported NIC`_
44
45 Only required when physical ports are in use
46
47 - A suitable kernel
48
49 On Linux Distros running kernel version >= 3.0, only `IOMMU` needs to enabled
50 via the grub cmdline, assuming you are using **VFIO**. For older kernels,
51 ensure the kernel is built with ``UIO``, ``HUGETLBFS``,
52 ``PROC_PAGE_MONITOR``, ``HPET``, ``HPET_MMAP`` support. If these are not
53 present, it will be necessary to upgrade your kernel or build a custom kernel
54 with these flags enabled.
55
56 Detailed system requirements can be found at `DPDK requirements`_.
57
58 .. _DPDK supported NIC: http://dpdk.org/doc/nics
59 .. _DPDK requirements: http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html
60
61 Installing
62 ----------
63
64 Install DPDK
65 ~~~~~~~~~~~~
66
67 #. Download the `DPDK sources`_, extract the file and set ``DPDK_DIR``::
68
69 $ cd /usr/src/
70 $ wget http://fast.dpdk.org/rel/dpdk-16.11.tar.xz
71 $ tar xf dpdk-16.11.tar.xz
72 $ export DPDK_DIR=/usr/src/dpdk-16.11
73 $ cd $DPDK_DIR
74
75 #. (Optional) Configure DPDK as a shared library
76
77 DPDK can be built as either a static library or a shared library. By
78 default, it is configured for the former. If you wish to use the latter, set
79 ``CONFIG_RTE_BUILD_SHARED_LIB=y`` in ``$DPDK_DIR/config/common_base``.
80
81 .. note::
82
83 Minor performance loss is expected when using OVS with a shared DPDK
84 library compared to a static DPDK library.
85
86 #. Configure and install DPDK
87
88 Build and install the DPDK library::
89
90 $ export DPDK_TARGET=x86_64-native-linuxapp-gcc
91 $ export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET
92 $ make install T=$DPDK_TARGET DESTDIR=install
93
94 If IVSHMEM support is required, use a different target::
95
96 $ export DPDK_TARGET=x86_64-ivshmem-linuxapp-gcc
97
98 #. (Optional) Export the DPDK shared library location
99
100 If DPDK was built as a shared library, export the path to this library for
101 use when building OVS::
102
103 $ export LD_LIBRARY_PATH=$DPDK_DIR/x86_64-native-linuxapp-gcc/lib
104
105 .. _DPDK sources: http://dpdk.org/rel
106
107 Install OVS
108 ~~~~~~~~~~~
109
110 OVS can be installed using different methods. For OVS to use DPDK datapath, it
111 has to be configured with DPDK support (``--with-dpdk``).
112
113 .. note::
114 This section focuses on generic recipe that suits most cases. For
115 distribution specific instructions, refer to one of the more relevant guides.
116
117 .. _OVS sources: http://openvswitch.org/releases/
118
119 #. Ensure the standard OVS requirements, described in
120 :ref:`general-build-reqs`, are installed
121
122 #. Bootstrap, if required, as described in :ref:`general-bootstrapping`
123
124 #. Configure the package using the ``--with-dpdk`` flag::
125
126 $ ./configure --with-dpdk=$DPDK_BUILD
127
128 where ``DPDK_BUILD`` is the path to the built DPDK library. This can be
129 skipped if DPDK library is installed in its default location
130
131 .. note::
132 While ``--with-dpdk`` is required, you can pass any other configuration
133 option described in :ref:`general-configuring`.
134
135 #. Build and install OVS, as described in :ref:`general-building`
136
137 Additional information can be found in :doc:`general`.
138
139 Setup
140 -----
141
142 Setup Hugepages
143 ~~~~~~~~~~~~~~~
144
145 Allocate a number of 2M Huge pages:
146
147 - For persistent allocation of huge pages, write to hugepages.conf file
148 in `/etc/sysctl.d`::
149
150 $ echo 'vm.nr_hugepages=2048' > /etc/sysctl.d/hugepages.conf
151
152 - For run-time allocation of huge pages, use the ``sysctl`` utility::
153
154 $ sysctl -w vm.nr_hugepages=N # where N = No. of 2M huge pages
155
156 To verify hugepage configuration::
157
158 $ grep HugePages_ /proc/meminfo
159
160 Mount the hugepages, if not already mounted by default::
161
162 $ mount -t hugetlbfs none /dev/hugepages``
163
164 .. _dpdk-vfio:
165
166 Setup DPDK devices using VFIO
167 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
168
169 VFIO is prefered to the UIO driver when using recent versions of DPDK. VFIO
170 support required support from both the kernel and BIOS. For the former, kernel
171 version > 3.6 must be used. For the latter, you must enable VT-d in the BIOS
172 and ensure this is configured via grub. To ensure VT-d is enabled via the BIOS,
173 run::
174
175 $ dmesg | grep -e DMAR -e IOMMU
176
177 If VT-d is not enabled in the BIOS, enable it now.
178
179 To ensure VT-d is enabled in the kernel, run::
180
181 $ cat /proc/cmdline | grep iommu=pt
182 $ cat /proc/cmdline | grep intel_iommu=on
183
184 If VT-d is not enabled in the kernel, enable it now.
185
186 Once VT-d is correctly configured, load the required modules and bind the NIC
187 to the VFIO driver::
188
189 $ modprobe vfio-pci
190 $ /usr/bin/chmod a+x /dev/vfio
191 $ /usr/bin/chmod 0666 /dev/vfio/*
192 $ $DPDK_DIR/tools/dpdk-devbind.py --bind=vfio-pci eth1
193 $ $DPDK_DIR/tools/dpdk-devbind.py --status
194
195 Setup OVS
196 ~~~~~~~~~
197
198 Open vSwitch should be started as described in :doc:`general` with the
199 exception of ovs-vswitchd, which requires some special configuration to enable
200 DPDK functionality. DPDK configuration arguments can be passed to ovs-vswitchd
201 via the ``other_config`` column of the ``Open_vSwitch`` table. At a minimum,
202 the ``dpdk-init`` option must be set to ``true``. For example::
203
204 $ export DB_SOCK=/usr/local/var/run/openvswitch/db.sock
205 $ ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
206 $ ovs-vswitchd unix:$DB_SOCK --pidfile --detach
207
208 There are many other configuration options, the most important of which are
209 listed below. Defaults will be provided for all values not explicitly set.
210
211 ``dpdk-init``
212 Specifies whether OVS should initialize and support DPDK ports. This is a
213 boolean, and defaults to false.
214
215 ``dpdk-lcore-mask``
216 Specifies the CPU cores on which dpdk lcore threads should be spawned and
217 expects hex string (eg '0x123').
218
219 ``dpdk-socket-mem``
220 Comma separated list of memory to pre-allocate from hugepages on specific
221 sockets.
222
223 ``dpdk-hugepage-dir``
224 Directory where hugetlbfs is mounted
225
226 ``vhost-sock-dir``
227 Option to set the path to the vhost-user unix socket files.
228
229 If allocating more than one GB hugepage (as for IVSHMEM), you can configure the
230 amount of memory used from any given NUMA nodes. For example, to use 1GB from
231 NUMA node 0, run::
232
233 $ ovs-vsctl --no-wait set Open_vSwitch . \
234 other_config:dpdk-socket-mem="1024,0"
235
236 Similarly, if you wish to better scale the workloads across cores, then
237 multiple pmd threads can be created and pinned to CPU cores by explicity
238 specifying ``pmd-cpu-mask``. Cores are numbered from 0, so to spawn two pmd
239 threads and pin them to cores 1,2, run::
240
241 $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x6
242
243 For details on using IVSHMEM with DPDK, refer to :doc:`/topics/dpdk/ivshmem`.
244
245 Refer to ovs-vswitchd.conf.db(5) for additional information on configuration
246 options.
247
248 .. note::
249 Changing any of these options requires restarting the ovs-vswitchd
250 application
251
252 Validating
253 ----------
254
255 At this point you can use ovs-vsctl to set up bridges and other Open vSwitch
256 features. Seeing as we've configured the DPDK datapath, we will use DPDK-type
257 ports. For example, to create a userspace bridge named ``br0`` and add two
258 ``dpdk`` ports to it, run::
259
260 $ ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
261 $ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
262 $ ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk
263
264 Refer to ovs-vsctl(8) and :doc:`/howto/dpdk` for more details.
265
266 Performance Tuning
267 ------------------
268
269 To achieve optimal OVS performance, the system can be configured and that
270 includes BIOS tweaks, Grub cmdline additions, better understanding of NUMA
271 nodes and apt selection of PCIe slots for NIC placement.
272
273 .. note::
274
275 This section is optional. Once installed as described above, OVS with DPDK
276 will work out of the box.
277
278 Recommended BIOS Settings
279 ~~~~~~~~~~~~~~~~~~~~~~~~~
280
281 .. list-table:: Recommended BIOS Settings
282 :header-rows: 1
283
284 * - Setting
285 - Value
286 * - C3 Power State
287 - Disabled
288 * - C6 Power State
289 - Disabled
290 * - MLC Streamer
291 - Enabled
292 * - MLC Spacial Prefetcher
293 - Enabled
294 * - DCU Data Prefetcher
295 - Enabled
296 * - DCA
297 - Enabled
298 * - CPU Power and Performance
299 - Performance
300 * - Memeory RAS and Performance Config -> NUMA optimized
301 - Enabled
302
303 PCIe Slot Selection
304 ~~~~~~~~~~~~~~~~~~~
305
306 The fastpath performance can be affected by factors related to the placement of
307 the NIC, such as channel speeds between PCIe slot and CPU or the proximity of
308 PCIe slot to the CPU cores running the DPDK application. Listed below are the
309 steps to identify right PCIe slot.
310
311 #. Retrieve host details using ``dmidecode``. For example::
312
313 $ dmidecode -t baseboard | grep "Product Name"
314
315 #. Download the technical specification for product listed, e.g: S2600WT2
316
317 #. Check the Product Architecture Overview on the Riser slot placement, CPU
318 sharing info and also PCIe channel speeds
319
320 For example: On S2600WT, CPU1 and CPU2 share Riser Slot 1 with Channel speed
321 between CPU1 and Riser Slot1 at 32GB/s, CPU2 and Riser Slot1 at 16GB/s.
322 Running DPDK app on CPU1 cores and NIC inserted in to Riser card Slots will
323 optimize OVS performance in this case.
324
325 #. Check the Riser Card #1 - Root Port mapping information, on the available
326 slots and individual bus speeds. In S2600WT slot 1, slot 2 has high bus
327 speeds and are potential slots for NIC placement.
328
329 Advanced Hugepage Setup
330 ~~~~~~~~~~~~~~~~~~~~~~~
331
332 Allocate and mount 1 GB hugepages.
333
334 - For persistent allocation of huge pages, add the following options to the
335 kernel bootline::
336
337 default_hugepagesz=1GB hugepagesz=1G hugepages=N
338
339 For platforms supporting multiple huge page sizes, add multiple options::
340
341 default_hugepagesz=<size> hugepagesz=<size> hugepages=N
342
343 where:
344
345 ``N``
346 number of huge pages requested
347 ``size``
348 huge page size with an optional suffix ``[kKmMgG]``
349
350 - For run-time allocation of huge pages::
351
352 $ echo N > /sys/devices/system/node/nodeX/hugepages/hugepages-1048576kB/nr_hugepages
353
354 where:
355
356 ``N``
357 number of huge pages requested
358 ``X``
359 NUMA Node
360
361 .. note::
362 For run-time allocation of 1G huge pages, Contiguous Memory Allocator
363 (``CONFIG_CMA``) has to be supported by kernel, check your Linux distro.
364
365 Now mount the huge pages, if not already done so::
366
367 $ mount -t hugetlbfs -o pagesize=1G none /dev/hugepages
368
369 Enable HyperThreading
370 ~~~~~~~~~~~~~~~~~~~~~
371
372 With HyperThreading, or SMT, enabled, a physical core appears as two logical
373 cores. SMT can be utilized to spawn worker threads on logical cores of the same
374 physical core there by saving additional cores.
375
376 With DPDK, when pinning pmd threads to logical cores, care must be taken to set
377 the correct bits of the ``pmd-cpu-mask`` to ensure that the pmd threads are
378 pinned to SMT siblings.
379
380 Take a sample system configuration, with 2 sockets, 2 * 10 core processors, HT
381 enabled. This gives us a total of 40 logical cores. To identify the physical
382 core shared by two logical cores, run::
383
384 $ cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list
385
386 where ``N`` is the logical core number.
387
388 In this example, it would show that cores ``1`` and ``21`` share the same
389 physical core. As cores are counted from 0, the ``pmd-cpu-mask`` can be used
390 to enable these two pmd threads running on these two logical cores (one
391 physical core) is::
392
393 $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x200002
394
395 Isolate Cores
396 ~~~~~~~~~~~~~
397
398 The ``isolcpus`` option can be used to isolate cores from the Linux scheduler.
399 The isolated cores can then be used to dedicatedly run HPC applications or
400 threads. This helps in better application performance due to zero context
401 switching and minimal cache thrashing. To run platform logic on core 0 and
402 isolate cores between 1 and 19 from scheduler, add ``isolcpus=1-19`` to GRUB
403 cmdline.
404
405 .. note::
406 It has been verified that core isolation has minimal advantage due to mature
407 Linux scheduler in some circumstances.
408
409 NUMA/Cluster-on-Die
410 ~~~~~~~~~~~~~~~~~~~
411
412 Ideally inter-NUMA datapaths should be avoided where possible as packets will
413 go across QPI and there may be a slight performance penalty when compared with
414 intra NUMA datapaths. On Intel Xeon Processor E5 v3, Cluster On Die is
415 introduced on models that have 10 cores or more. This makes it possible to
416 logically split a socket into two NUMA regions and again it is preferred where
417 possible to keep critical datapaths within the one cluster.
418
419 It is good practice to ensure that threads that are in the datapath are pinned
420 to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs responsible for
421 forwarding. If DPDK is built with ``CONFIG_RTE_LIBRTE_VHOST_NUMA=y``, vHost
422 User ports automatically detect the NUMA socket of the QEMU vCPUs and will be
423 serviced by a PMD from the same node provided a core on this node is enabled in
424 the ``pmd-cpu-mask``. ``libnuma`` packages are required for this feature.
425
426 Compiler Optimizations
427 ~~~~~~~~~~~~~~~~~~~~~~
428
429 The default compiler optimization level is ``-O2``. Changing this to more
430 aggressive compiler optimization such as ``-O3 -march=native`` with
431 gcc (verified on 5.3.1) can produce performance gains though not siginificant.
432 ``-march=native`` will produce optimized code on local machine and should be
433 used when software compilation is done on Testbed.
434
435 Affinity
436 ~~~~~~~~
437
438 For superior performance, DPDK pmd threads and Qemu vCPU threads needs to be
439 affinitized accordingly.
440
441 - PMD thread Affinity
442
443 A poll mode driver (pmd) thread handles the I/O of all DPDK interfaces
444 assigned to it. A pmd thread shall poll the ports for incoming packets,
445 switch the packets and send to tx port. pmd thread is CPU bound, and needs
446 to be affinitized to isolated cores for optimum performance.
447
448 By setting a bit in the mask, a pmd thread is created and pinned to the
449 corresponding CPU core. e.g. to run a pmd thread on core 2::
450
451 $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
452
453 .. note::
454 pmd thread on a NUMA node is only created if there is at least one DPDK
455 interface from that NUMA node added to OVS.
456
457 - QEMU vCPU thread Affinity
458
459 A VM performing simple packet forwarding or running complex packet pipelines
460 has to ensure that the vCPU threads performing the work has as much CPU
461 occupancy as possible.
462
463 For example, on a multicore VM, multiple QEMU vCPU threads shall be spawned.
464 When the DPDK ``testpmd`` application that does packet forwarding is invoked,
465 the ``taskset`` command should be used to affinitize the vCPU threads to the
466 dedicated isolated cores on the host system.
467
468 Multiple Poll-Mode Driver Threads
469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
470
471 With pmd multi-threading support, OVS creates one pmd thread for each NUMA node
472 by default. However, in cases where there are multiple ports/rxq's producing
473 traffic, performance can be improved by creating multiple pmd threads running
474 on separate cores. These pmd threads can share the workload by each being
475 responsible for different ports/rxq's. Assignment of ports/rxq's to pmd threads
476 is done automatically.
477
478 A set bit in the mask means a pmd thread is created and pinned to the
479 corresponding CPU core. For example, to run pmd threads on core 1 and 2::
480
481 $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x6
482
483 When using dpdk and dpdkvhostuser ports in a bi-directional VM loopback as
484 shown below, spreading the workload over 2 or 4 pmd threads shows significant
485 improvements as there will be more total CPU occupancy available::
486
487 NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1
488
489 DPDK Physical Port Rx Queues
490 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
491
492 ::
493
494 $ ovs-vsctl set Interface <DPDK interface> options:n_rxq=<integer>
495
496 The above command sets the number of rx queues for DPDK physical interface.
497 The rx queues are assigned to pmd threads on the same NUMA node in a
498 round-robin fashion.
499
500 DPDK Physical Port Queue Sizes
501 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
502
503 ::
504
505 $ ovs-vsctl set Interface dpdk0 options:n_rxq_desc=<integer>
506 $ ovs-vsctl set Interface dpdk0 options:n_txq_desc=<integer>
507
508 The above command sets the number of rx/tx descriptors that the NIC associated
509 with dpdk0 will be initialised with.
510
511 Different ``n_rxq_desc`` and ``n_txq_desc`` configurations yield different
512 benefits in terms of throughput and latency for different scenarios.
513 Generally, smaller queue sizes can have a positive impact for latency at the
514 expense of throughput. The opposite is often true for larger queue sizes.
515 Note: increasing the number of rx descriptors eg. to 4096 may have a negative
516 impact on performance due to the fact that non-vectorised DPDK rx functions may
517 be used. This is dependent on the driver in use, but is true for the commonly
518 used i40e and ixgbe DPDK drivers.
519
520 Exact Match Cache
521 ~~~~~~~~~~~~~~~~~
522
523 Each pmd thread contains one Exact Match Cache (EMC). After initial flow setup
524 in the datapath, the EMC contains a single table and provides the lowest level
525 (fastest) switching for DPDK ports. If there is a miss in the EMC then the next
526 level where switching will occur is the datapath classifier. Missing in the
527 EMC and looking up in the datapath classifier incurs a significant performance
528 penalty. If lookup misses occur in the EMC because it is too small to handle
529 the number of flows, its size can be increased. The EMC size can be modified by
530 editing the define ``EM_FLOW_HASH_SHIFT`` in ``lib/dpif-netdev.c``.
531
532 As mentioned above, an EMC is per pmd thread. An alternative way of increasing
533 the aggregate amount of possible flow entries in EMC and avoiding datapath
534 classifier lookups is to have multiple pmd threads running.
535
536 Rx Mergeable Buffers
537 ~~~~~~~~~~~~~~~~~~~~
538
539 Rx mergeable buffers is a virtio feature that allows chaining of multiple
540 virtio descriptors to handle large packet sizes. Large packets are handled by
541 reserving and chaining multiple free descriptors together. Mergeable buffer
542 support is negotiated between the virtio driver and virtio device and is
543 supported by the DPDK vhost library. This behavior is supported and enabled by
544 default, however in the case where the user knows that rx mergeable buffers are
545 not needed i.e. jumbo frames are not needed, it can be forced off by adding
546 ``mrg_rxbuf=off`` to the QEMU command line options. By not reserving multiple
547 chains of descriptors it will make more individual virtio descriptors available
548 for rx to the guest using dpdkvhost ports and this can improve performance.
549
550 Limitations
551 ------------
552
553 - Currently DPDK ports does not use HW offload functionality.
554 - Network Interface Firmware requirements: Each release of DPDK is
555 validated against a specific firmware version for a supported Network
556 Interface. New firmware versions introduce bug fixes, performance
557 improvements and new functionality that DPDK leverages. The validated
558 firmware versions are available as part of the release notes for
559 DPDK. It is recommended that users update Network Interface firmware
560 to match what has been validated for the DPDK release.
561
562 The latest list of validated firmware versions can be found in the `DPDK
563 release notes`_.
564
565 .. _DPDK release notes: http://dpdk.org/doc/guides/rel_notes/release_16_11.html
566
567 Reporting Bugs
568 --------------
569
570 Report problems to bugs@openvswitch.org.