]>
Commit | Line | Data |
---|---|---|
167703d6 SF |
1 | .. |
2 | Licensed under the Apache License, Version 2.0 (the "License"); you may | |
3 | not use this file except in compliance with the License. You may obtain | |
4 | a copy of the License at | |
5 | ||
6 | http://www.apache.org/licenses/LICENSE-2.0 | |
7 | ||
8 | Unless required by applicable law or agreed to in writing, software | |
9 | distributed under the License is distributed on an "AS IS" BASIS, WITHOUT | |
10 | WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the | |
11 | License for the specific language governing permissions and limitations | |
12 | under the License. | |
13 | ||
14 | Convention for heading levels in Open vSwitch documentation: | |
15 | ||
16 | ======= Heading 0 (reserved for the title in a document) | |
17 | ------- Heading 1 | |
18 | ~~~~~~~ Heading 2 | |
19 | +++++++ Heading 3 | |
20 | ''''''' Heading 4 | |
21 | ||
22 | Avoid deeper levels because they do not render well. | |
23 | ||
24 | ====================== | |
25 | Open vSwitch with DPDK | |
26 | ====================== | |
27 | ||
28 | This document describes how to build and install Open vSwitch using a DPDK | |
29 | datapath. Open vSwitch can use the DPDK library to operate entirely in | |
30 | userspace. | |
31 | ||
624f6206 SF |
32 | .. seealso:: |
33 | ||
34 | The :doc:`releases FAQ </faq/releases>` lists support for the required | |
35 | versions of DPDK for each version of Open vSwitch. | |
36 | ||
167703d6 SF |
37 | Build requirements |
38 | ------------------ | |
39 | ||
795752a3 SF |
40 | In addition to the requirements described in :doc:`general`, building Open |
41 | vSwitch with DPDK will require the following: | |
167703d6 | 42 | |
5e925ccc | 43 | - DPDK 17.11 |
167703d6 SF |
44 | |
45 | - A `DPDK supported NIC`_ | |
46 | ||
47 | Only required when physical ports are in use | |
48 | ||
49 | - A suitable kernel | |
50 | ||
51 | On Linux Distros running kernel version >= 3.0, only `IOMMU` needs to enabled | |
52 | via the grub cmdline, assuming you are using **VFIO**. For older kernels, | |
53 | ensure the kernel is built with ``UIO``, ``HUGETLBFS``, | |
54 | ``PROC_PAGE_MONITOR``, ``HPET``, ``HPET_MMAP`` support. If these are not | |
55 | present, it will be necessary to upgrade your kernel or build a custom kernel | |
56 | with these flags enabled. | |
57 | ||
e69e4f5b | 58 | Detailed system requirements can be found at `DPDK requirements`_. |
167703d6 SF |
59 | |
60 | .. _DPDK supported NIC: http://dpdk.org/doc/nics | |
61 | .. _DPDK requirements: http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html | |
62 | ||
63 | Installing | |
64 | ---------- | |
65 | ||
e69e4f5b SF |
66 | Install DPDK |
67 | ~~~~~~~~~~~~ | |
167703d6 | 68 | |
e69e4f5b | 69 | #. Download the `DPDK sources`_, extract the file and set ``DPDK_DIR``:: |
167703d6 SF |
70 | |
71 | $ cd /usr/src/ | |
5e925ccc MK |
72 | $ wget http://fast.dpdk.org/rel/dpdk-17.11.tar.xz |
73 | $ tar xf dpdk-17.11.tar.xz | |
74 | $ export DPDK_DIR=/usr/src/dpdk-17.11 | |
167703d6 SF |
75 | $ cd $DPDK_DIR |
76 | ||
e69e4f5b SF |
77 | #. (Optional) Configure DPDK as a shared library |
78 | ||
79 | DPDK can be built as either a static library or a shared library. By | |
80 | default, it is configured for the former. If you wish to use the latter, set | |
81 | ``CONFIG_RTE_BUILD_SHARED_LIB=y`` in ``$DPDK_DIR/config/common_base``. | |
82 | ||
83 | .. note:: | |
84 | ||
85 | Minor performance loss is expected when using OVS with a shared DPDK | |
86 | library compared to a static DPDK library. | |
87 | ||
88 | #. Configure and install DPDK | |
167703d6 | 89 | |
dc76953f | 90 | Build and install the DPDK library:: |
167703d6 SF |
91 | |
92 | $ export DPDK_TARGET=x86_64-native-linuxapp-gcc | |
93 | $ export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET | |
94 | $ make install T=$DPDK_TARGET DESTDIR=install | |
95 | ||
e69e4f5b SF |
96 | #. (Optional) Export the DPDK shared library location |
97 | ||
98 | If DPDK was built as a shared library, export the path to this library for | |
99 | use when building OVS:: | |
100 | ||
101 | $ export LD_LIBRARY_PATH=$DPDK_DIR/x86_64-native-linuxapp-gcc/lib | |
102 | ||
2ee39aec | 103 | .. _DPDK sources: http://dpdk.org/rel |
167703d6 SF |
104 | |
105 | Install OVS | |
106 | ~~~~~~~~~~~ | |
107 | ||
108 | OVS can be installed using different methods. For OVS to use DPDK datapath, it | |
109 | has to be configured with DPDK support (``--with-dpdk``). | |
110 | ||
111 | .. note:: | |
112 | This section focuses on generic recipe that suits most cases. For | |
113 | distribution specific instructions, refer to one of the more relevant guides. | |
114 | ||
115 | .. _OVS sources: http://openvswitch.org/releases/ | |
116 | ||
e69e4f5b | 117 | #. Ensure the standard OVS requirements, described in |
795752a3 | 118 | :ref:`general-build-reqs`, are installed |
167703d6 | 119 | |
e69e4f5b | 120 | #. Bootstrap, if required, as described in :ref:`general-bootstrapping` |
167703d6 | 121 | |
e69e4f5b | 122 | #. Configure the package using the ``--with-dpdk`` flag:: |
167703d6 SF |
123 | |
124 | $ ./configure --with-dpdk=$DPDK_BUILD | |
125 | ||
126 | where ``DPDK_BUILD`` is the path to the built DPDK library. This can be | |
96195c09 CE |
127 | skipped if DPDK library is installed in its default location. |
128 | ||
129 | If no path is provided to ``--with-dpdk``, but a pkg-config configuration | |
130 | for libdpdk is available the include paths will be generated via an | |
131 | equivalent ``pkg-config --cflags libdpdk``. | |
167703d6 SF |
132 | |
133 | .. note:: | |
134 | While ``--with-dpdk`` is required, you can pass any other configuration | |
795752a3 | 135 | option described in :ref:`general-configuring`. |
167703d6 | 136 | |
e69e4f5b | 137 | #. Build and install OVS, as described in :ref:`general-building` |
167703d6 | 138 | |
795752a3 | 139 | Additional information can be found in :doc:`general`. |
167703d6 | 140 | |
e3e738a3 | 141 | .. note:: |
142 | If you are running using the Fedora or Red Hat package, the Open vSwitch | |
143 | daemon will run as a non-root user. This implies that you must have a | |
144 | working IOMMU. Visit the `RHEL README`__ for additional information. | |
145 | ||
146 | __ https://github.com/openvswitch/ovs/blob/master/rhel/README.RHEL.rst | |
147 | ||
167703d6 SF |
148 | Setup |
149 | ----- | |
150 | ||
151 | Setup Hugepages | |
152 | ~~~~~~~~~~~~~~~ | |
153 | ||
154 | Allocate a number of 2M Huge pages: | |
155 | ||
156 | - For persistent allocation of huge pages, write to hugepages.conf file | |
dc76953f | 157 | in `/etc/sysctl.d`:: |
167703d6 SF |
158 | |
159 | $ echo 'vm.nr_hugepages=2048' > /etc/sysctl.d/hugepages.conf | |
160 | ||
dc76953f | 161 | - For run-time allocation of huge pages, use the ``sysctl`` utility:: |
167703d6 SF |
162 | |
163 | $ sysctl -w vm.nr_hugepages=N # where N = No. of 2M huge pages | |
164 | ||
dc76953f | 165 | To verify hugepage configuration:: |
167703d6 SF |
166 | |
167 | $ grep HugePages_ /proc/meminfo | |
168 | ||
dc76953f | 169 | Mount the hugepages, if not already mounted by default:: |
167703d6 SF |
170 | |
171 | $ mount -t hugetlbfs none /dev/hugepages`` | |
172 | ||
173 | .. _dpdk-vfio: | |
174 | ||
175 | Setup DPDK devices using VFIO | |
176 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
177 | ||
178 | VFIO is prefered to the UIO driver when using recent versions of DPDK. VFIO | |
179 | support required support from both the kernel and BIOS. For the former, kernel | |
180 | version > 3.6 must be used. For the latter, you must enable VT-d in the BIOS | |
181 | and ensure this is configured via grub. To ensure VT-d is enabled via the BIOS, | |
dc76953f | 182 | run:: |
167703d6 SF |
183 | |
184 | $ dmesg | grep -e DMAR -e IOMMU | |
185 | ||
186 | If VT-d is not enabled in the BIOS, enable it now. | |
187 | ||
dc76953f | 188 | To ensure VT-d is enabled in the kernel, run:: |
167703d6 SF |
189 | |
190 | $ cat /proc/cmdline | grep iommu=pt | |
191 | $ cat /proc/cmdline | grep intel_iommu=on | |
192 | ||
193 | If VT-d is not enabled in the kernel, enable it now. | |
194 | ||
195 | Once VT-d is correctly configured, load the required modules and bind the NIC | |
dc76953f | 196 | to the VFIO driver:: |
167703d6 SF |
197 | |
198 | $ modprobe vfio-pci | |
199 | $ /usr/bin/chmod a+x /dev/vfio | |
200 | $ /usr/bin/chmod 0666 /dev/vfio/* | |
f3e7ec25 MW |
201 | $ $DPDK_DIR/usertools/dpdk-devbind.py --bind=vfio-pci eth1 |
202 | $ $DPDK_DIR/usertools/dpdk-devbind.py --status | |
167703d6 SF |
203 | |
204 | Setup OVS | |
205 | ~~~~~~~~~ | |
206 | ||
795752a3 SF |
207 | Open vSwitch should be started as described in :doc:`general` with the |
208 | exception of ovs-vswitchd, which requires some special configuration to enable | |
209 | DPDK functionality. DPDK configuration arguments can be passed to ovs-vswitchd | |
210 | via the ``other_config`` column of the ``Open_vSwitch`` table. At a minimum, | |
211 | the ``dpdk-init`` option must be set to ``true``. For example:: | |
167703d6 | 212 | |
588e0ebc | 213 | $ export PATH=$PATH:/usr/local/share/openvswitch/scripts |
167703d6 SF |
214 | $ export DB_SOCK=/usr/local/var/run/openvswitch/db.sock |
215 | $ ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true | |
588e0ebc | 216 | $ ovs-ctl --no-ovsdb-server --db-sock="$DB_SOCK" start |
167703d6 SF |
217 | |
218 | There are many other configuration options, the most important of which are | |
219 | listed below. Defaults will be provided for all values not explicitly set. | |
220 | ||
221 | ``dpdk-init`` | |
222 | Specifies whether OVS should initialize and support DPDK ports. This is a | |
223 | boolean, and defaults to false. | |
224 | ||
225 | ``dpdk-lcore-mask`` | |
226 | Specifies the CPU cores on which dpdk lcore threads should be spawned and | |
227 | expects hex string (eg '0x123'). | |
228 | ||
229 | ``dpdk-socket-mem`` | |
230 | Comma separated list of memory to pre-allocate from hugepages on specific | |
231 | sockets. | |
232 | ||
233 | ``dpdk-hugepage-dir`` | |
234 | Directory where hugetlbfs is mounted | |
235 | ||
236 | ``vhost-sock-dir`` | |
237 | Option to set the path to the vhost-user unix socket files. | |
238 | ||
90ca71dd | 239 | If allocating more than one GB hugepage, you can configure the |
167703d6 | 240 | amount of memory used from any given NUMA nodes. For example, to use 1GB from |
cd6c5bc8 | 241 | NUMA node 0 and 0GB for all other NUMA nodes, run:: |
167703d6 SF |
242 | |
243 | $ ovs-vsctl --no-wait set Open_vSwitch . \ | |
244 | other_config:dpdk-socket-mem="1024,0" | |
245 | ||
cd6c5bc8 KT |
246 | or:: |
247 | ||
248 | $ ovs-vsctl --no-wait set Open_vSwitch . \ | |
249 | other_config:dpdk-socket-mem="1024" | |
250 | ||
167703d6 SF |
251 | .. note:: |
252 | Changing any of these options requires restarting the ovs-vswitchd | |
253 | application | |
254 | ||
441cb3eb DB |
255 | See the section ``Performance Tuning`` for important DPDK customizations. |
256 | ||
167703d6 SF |
257 | Validating |
258 | ---------- | |
259 | ||
e69e4f5b SF |
260 | At this point you can use ovs-vsctl to set up bridges and other Open vSwitch |
261 | features. Seeing as we've configured the DPDK datapath, we will use DPDK-type | |
262 | ports. For example, to create a userspace bridge named ``br0`` and add two | |
263 | ``dpdk`` ports to it, run:: | |
167703d6 SF |
264 | |
265 | $ ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev | |
55e075e6 CL |
266 | $ ovs-vsctl add-port br0 myportnameone -- set Interface myportnameone \ |
267 | type=dpdk options:dpdk-devargs=0000:06:00.0 | |
268 | $ ovs-vsctl add-port br0 myportnametwo -- set Interface myportnametwo \ | |
269 | type=dpdk options:dpdk-devargs=0000:06:00.1 | |
270 | ||
271 | DPDK devices will not be available for use until a valid dpdk-devargs is | |
272 | specified. | |
167703d6 | 273 | |
e69e4f5b | 274 | Refer to ovs-vsctl(8) and :doc:`/howto/dpdk` for more details. |
167703d6 | 275 | |
e69e4f5b SF |
276 | Performance Tuning |
277 | ------------------ | |
167703d6 | 278 | |
e69e4f5b SF |
279 | To achieve optimal OVS performance, the system can be configured and that |
280 | includes BIOS tweaks, Grub cmdline additions, better understanding of NUMA | |
281 | nodes and apt selection of PCIe slots for NIC placement. | |
167703d6 | 282 | |
e69e4f5b | 283 | .. note:: |
167703d6 | 284 | |
e69e4f5b SF |
285 | This section is optional. Once installed as described above, OVS with DPDK |
286 | will work out of the box. | |
167703d6 | 287 | |
e69e4f5b SF |
288 | Recommended BIOS Settings |
289 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | |
167703d6 | 290 | |
e69e4f5b SF |
291 | .. list-table:: Recommended BIOS Settings |
292 | :header-rows: 1 | |
167703d6 | 293 | |
e69e4f5b SF |
294 | * - Setting |
295 | - Value | |
296 | * - C3 Power State | |
297 | - Disabled | |
298 | * - C6 Power State | |
299 | - Disabled | |
300 | * - MLC Streamer | |
301 | - Enabled | |
302 | * - MLC Spacial Prefetcher | |
303 | - Enabled | |
304 | * - DCU Data Prefetcher | |
305 | - Enabled | |
306 | * - DCA | |
307 | - Enabled | |
308 | * - CPU Power and Performance | |
309 | - Performance | |
310 | * - Memeory RAS and Performance Config -> NUMA optimized | |
311 | - Enabled | |
167703d6 | 312 | |
e69e4f5b SF |
313 | PCIe Slot Selection |
314 | ~~~~~~~~~~~~~~~~~~~ | |
167703d6 | 315 | |
e69e4f5b SF |
316 | The fastpath performance can be affected by factors related to the placement of |
317 | the NIC, such as channel speeds between PCIe slot and CPU or the proximity of | |
318 | PCIe slot to the CPU cores running the DPDK application. Listed below are the | |
319 | steps to identify right PCIe slot. | |
167703d6 | 320 | |
e69e4f5b | 321 | #. Retrieve host details using ``dmidecode``. For example:: |
167703d6 | 322 | |
e69e4f5b | 323 | $ dmidecode -t baseboard | grep "Product Name" |
167703d6 | 324 | |
e69e4f5b | 325 | #. Download the technical specification for product listed, e.g: S2600WT2 |
167703d6 | 326 | |
e69e4f5b SF |
327 | #. Check the Product Architecture Overview on the Riser slot placement, CPU |
328 | sharing info and also PCIe channel speeds | |
167703d6 | 329 | |
e69e4f5b SF |
330 | For example: On S2600WT, CPU1 and CPU2 share Riser Slot 1 with Channel speed |
331 | between CPU1 and Riser Slot1 at 32GB/s, CPU2 and Riser Slot1 at 16GB/s. | |
332 | Running DPDK app on CPU1 cores and NIC inserted in to Riser card Slots will | |
333 | optimize OVS performance in this case. | |
167703d6 | 334 | |
e69e4f5b SF |
335 | #. Check the Riser Card #1 - Root Port mapping information, on the available |
336 | slots and individual bus speeds. In S2600WT slot 1, slot 2 has high bus | |
337 | speeds and are potential slots for NIC placement. | |
167703d6 | 338 | |
e69e4f5b SF |
339 | Advanced Hugepage Setup |
340 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
167703d6 | 341 | |
e69e4f5b | 342 | Allocate and mount 1 GB hugepages. |
167703d6 | 343 | |
e69e4f5b SF |
344 | - For persistent allocation of huge pages, add the following options to the |
345 | kernel bootline:: | |
167703d6 | 346 | |
e69e4f5b | 347 | default_hugepagesz=1GB hugepagesz=1G hugepages=N |
167703d6 | 348 | |
e69e4f5b | 349 | For platforms supporting multiple huge page sizes, add multiple options:: |
167703d6 | 350 | |
e69e4f5b | 351 | default_hugepagesz=<size> hugepagesz=<size> hugepages=N |
167703d6 | 352 | |
e69e4f5b | 353 | where: |
167703d6 | 354 | |
e69e4f5b SF |
355 | ``N`` |
356 | number of huge pages requested | |
357 | ``size`` | |
358 | huge page size with an optional suffix ``[kKmMgG]`` | |
167703d6 | 359 | |
e69e4f5b | 360 | - For run-time allocation of huge pages:: |
167703d6 | 361 | |
e69e4f5b | 362 | $ echo N > /sys/devices/system/node/nodeX/hugepages/hugepages-1048576kB/nr_hugepages |
167703d6 | 363 | |
e69e4f5b | 364 | where: |
167703d6 | 365 | |
e69e4f5b SF |
366 | ``N`` |
367 | number of huge pages requested | |
368 | ``X`` | |
369 | NUMA Node | |
167703d6 | 370 | |
e69e4f5b SF |
371 | .. note:: |
372 | For run-time allocation of 1G huge pages, Contiguous Memory Allocator | |
373 | (``CONFIG_CMA``) has to be supported by kernel, check your Linux distro. | |
167703d6 | 374 | |
e69e4f5b | 375 | Now mount the huge pages, if not already done so:: |
167703d6 | 376 | |
e69e4f5b | 377 | $ mount -t hugetlbfs -o pagesize=1G none /dev/hugepages |
167703d6 | 378 | |
e69e4f5b SF |
379 | Isolate Cores |
380 | ~~~~~~~~~~~~~ | |
381 | ||
382 | The ``isolcpus`` option can be used to isolate cores from the Linux scheduler. | |
383 | The isolated cores can then be used to dedicatedly run HPC applications or | |
384 | threads. This helps in better application performance due to zero context | |
385 | switching and minimal cache thrashing. To run platform logic on core 0 and | |
386 | isolate cores between 1 and 19 from scheduler, add ``isolcpus=1-19`` to GRUB | |
387 | cmdline. | |
167703d6 SF |
388 | |
389 | .. note:: | |
e69e4f5b SF |
390 | It has been verified that core isolation has minimal advantage due to mature |
391 | Linux scheduler in some circumstances. | |
167703d6 | 392 | |
e69e4f5b SF |
393 | Compiler Optimizations |
394 | ~~~~~~~~~~~~~~~~~~~~~~ | |
395 | ||
396 | The default compiler optimization level is ``-O2``. Changing this to more | |
397 | aggressive compiler optimization such as ``-O3 -march=native`` with | |
398 | gcc (verified on 5.3.1) can produce performance gains though not siginificant. | |
399 | ``-march=native`` will produce optimized code on local machine and should be | |
400 | used when software compilation is done on Testbed. | |
401 | ||
441cb3eb DB |
402 | Multiple Poll-Mode Driver Threads |
403 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
404 | ||
405 | With pmd multi-threading support, OVS creates one pmd thread for each NUMA node | |
406 | by default, if there is at least one DPDK interface from that NUMA node added | |
407 | to OVS. However, in cases where there are multiple ports/rxq's producing | |
408 | traffic, performance can be improved by creating multiple pmd threads running | |
409 | on separate cores. These pmd threads can share the workload by each being | |
410 | responsible for different ports/rxq's. Assignment of ports/rxq's to pmd threads | |
411 | is done automatically. | |
412 | ||
413 | A set bit in the mask means a pmd thread is created and pinned to the | |
414 | corresponding CPU core. For example, to run pmd threads on core 1 and 2:: | |
415 | ||
416 | $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x6 | |
417 | ||
418 | When using dpdk and dpdkvhostuser ports in a bi-directional VM loopback as | |
419 | shown below, spreading the workload over 2 or 4 pmd threads shows significant | |
420 | improvements as there will be more total CPU occupancy available:: | |
421 | ||
422 | NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1 | |
423 | ||
424 | Refer to ovs-vswitchd.conf.db(5) for additional information on configuration | |
425 | options. | |
426 | ||
e69e4f5b SF |
427 | Affinity |
428 | ~~~~~~~~ | |
429 | ||
430 | For superior performance, DPDK pmd threads and Qemu vCPU threads needs to be | |
431 | affinitized accordingly. | |
432 | ||
433 | - PMD thread Affinity | |
434 | ||
435 | A poll mode driver (pmd) thread handles the I/O of all DPDK interfaces | |
436 | assigned to it. A pmd thread shall poll the ports for incoming packets, | |
c37813fd | 437 | switch the packets and send to tx port. A pmd thread is CPU bound, and needs |
441cb3eb DB |
438 | to be affinitized to isolated cores for optimum performance. Even though a |
439 | PMD thread may exist, the thread only starts consuming CPU cycles if there is | |
440 | at least one receive queue assigned to the pmd. | |
c37813fd BM |
441 | |
442 | .. note:: | |
441cb3eb | 443 | On NUMA systems, PCI devices are also local to a NUMA node. Unbound rx |
c37813fd BM |
444 | queues for a PCI device will be assigned to a pmd on it's local NUMA node |
445 | if a non-isolated PMD exists on that NUMA node. If not, the queue will be | |
446 | assigned to a non-isolated pmd on a remote NUMA node. This will result in | |
447 | reduced maximum throughput on that device and possibly on other devices | |
448 | assigned to that pmd thread. If such a queue assignment is made a warning | |
449 | message will be logged: "There's no available (non-isolated) pmd thread on | |
450 | numa node N. Queue Q on port P will be assigned to the pmd on core C | |
451 | (numa node N'). Expect reduced performance." | |
e69e4f5b | 452 | |
441cb3eb DB |
453 | Binding PMD threads to cores is described in the above section |
454 | ``Multiple Poll-Mode Driver Threads``. | |
455 | ||
e69e4f5b SF |
456 | - QEMU vCPU thread Affinity |
457 | ||
458 | A VM performing simple packet forwarding or running complex packet pipelines | |
459 | has to ensure that the vCPU threads performing the work has as much CPU | |
460 | occupancy as possible. | |
461 | ||
462 | For example, on a multicore VM, multiple QEMU vCPU threads shall be spawned. | |
463 | When the DPDK ``testpmd`` application that does packet forwarding is invoked, | |
464 | the ``taskset`` command should be used to affinitize the vCPU threads to the | |
465 | dedicated isolated cores on the host system. | |
466 | ||
441cb3eb DB |
467 | Enable HyperThreading |
468 | ~~~~~~~~~~~~~~~~~~~~~ | |
e69e4f5b | 469 | |
441cb3eb DB |
470 | With HyperThreading, or SMT, enabled, a physical core appears as two logical |
471 | cores. SMT can be utilized to spawn worker threads on logical cores of the same | |
472 | physical core there by saving additional cores. | |
e69e4f5b | 473 | |
441cb3eb DB |
474 | With DPDK, when pinning pmd threads to logical cores, care must be taken to set |
475 | the correct bits of the ``pmd-cpu-mask`` to ensure that the pmd threads are | |
476 | pinned to SMT siblings. | |
e69e4f5b | 477 | |
441cb3eb DB |
478 | Take a sample system configuration, with 2 sockets, 2 * 10 core processors, HT |
479 | enabled. This gives us a total of 40 logical cores. To identify the physical | |
480 | core shared by two logical cores, run:: | |
e69e4f5b | 481 | |
441cb3eb | 482 | $ cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list |
e69e4f5b | 483 | |
441cb3eb DB |
484 | where ``N`` is the logical core number. |
485 | ||
486 | In this example, it would show that cores ``1`` and ``21`` share the same | |
487 | physical core. Logical cores can be specified in pmd-cpu-masks similarly to | |
488 | physical cores, as described in ``Multiple Poll-Mode Driver Threads``. | |
489 | ||
490 | NUMA/Cluster-on-Die | |
491 | ~~~~~~~~~~~~~~~~~~~ | |
492 | ||
493 | Ideally inter-NUMA datapaths should be avoided where possible as packets will | |
494 | go across QPI and there may be a slight performance penalty when compared with | |
495 | intra NUMA datapaths. On Intel Xeon Processor E5 v3, Cluster On Die is | |
496 | introduced on models that have 10 cores or more. This makes it possible to | |
497 | logically split a socket into two NUMA regions and again it is preferred where | |
498 | possible to keep critical datapaths within the one cluster. | |
499 | ||
500 | It is good practice to ensure that threads that are in the datapath are pinned | |
501 | to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs responsible for | |
502 | forwarding. If DPDK is built with ``CONFIG_RTE_LIBRTE_VHOST_NUMA=y``, vHost | |
503 | User ports automatically detect the NUMA socket of the QEMU vCPUs and will be | |
504 | serviced by a PMD from the same node provided a core on this node is enabled in | |
505 | the ``pmd-cpu-mask``. ``libnuma`` packages are required for this feature. | |
506 | ||
507 | Binding PMD threads is described in the above section | |
508 | ``Multiple Poll-Mode Driver Threads``. | |
e69e4f5b SF |
509 | |
510 | DPDK Physical Port Rx Queues | |
511 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
512 | ||
513 | :: | |
514 | ||
515 | $ ovs-vsctl set Interface <DPDK interface> options:n_rxq=<integer> | |
516 | ||
517 | The above command sets the number of rx queues for DPDK physical interface. | |
518 | The rx queues are assigned to pmd threads on the same NUMA node in a | |
519 | round-robin fashion. | |
520 | ||
a0b62aac CL |
521 | .. _dpdk-queues-sizes: |
522 | ||
e69e4f5b SF |
523 | DPDK Physical Port Queue Sizes |
524 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
525 | ||
526 | :: | |
527 | ||
528 | $ ovs-vsctl set Interface dpdk0 options:n_rxq_desc=<integer> | |
529 | $ ovs-vsctl set Interface dpdk0 options:n_txq_desc=<integer> | |
530 | ||
531 | The above command sets the number of rx/tx descriptors that the NIC associated | |
532 | with dpdk0 will be initialised with. | |
533 | ||
534 | Different ``n_rxq_desc`` and ``n_txq_desc`` configurations yield different | |
535 | benefits in terms of throughput and latency for different scenarios. | |
536 | Generally, smaller queue sizes can have a positive impact for latency at the | |
537 | expense of throughput. The opposite is often true for larger queue sizes. | |
538 | Note: increasing the number of rx descriptors eg. to 4096 may have a negative | |
539 | impact on performance due to the fact that non-vectorised DPDK rx functions may | |
b4675b81 | 540 | be used. This is dependent on the driver in use, but is true for the commonly |
e69e4f5b SF |
541 | used i40e and ixgbe DPDK drivers. |
542 | ||
543 | Exact Match Cache | |
544 | ~~~~~~~~~~~~~~~~~ | |
545 | ||
546 | Each pmd thread contains one Exact Match Cache (EMC). After initial flow setup | |
547 | in the datapath, the EMC contains a single table and provides the lowest level | |
548 | (fastest) switching for DPDK ports. If there is a miss in the EMC then the next | |
549 | level where switching will occur is the datapath classifier. Missing in the | |
550 | EMC and looking up in the datapath classifier incurs a significant performance | |
551 | penalty. If lookup misses occur in the EMC because it is too small to handle | |
552 | the number of flows, its size can be increased. The EMC size can be modified by | |
553 | editing the define ``EM_FLOW_HASH_SHIFT`` in ``lib/dpif-netdev.c``. | |
554 | ||
555 | As mentioned above, an EMC is per pmd thread. An alternative way of increasing | |
556 | the aggregate amount of possible flow entries in EMC and avoiding datapath | |
557 | classifier lookups is to have multiple pmd threads running. | |
558 | ||
559 | Rx Mergeable Buffers | |
560 | ~~~~~~~~~~~~~~~~~~~~ | |
167703d6 | 561 | |
e69e4f5b SF |
562 | Rx mergeable buffers is a virtio feature that allows chaining of multiple |
563 | virtio descriptors to handle large packet sizes. Large packets are handled by | |
564 | reserving and chaining multiple free descriptors together. Mergeable buffer | |
565 | support is negotiated between the virtio driver and virtio device and is | |
566 | supported by the DPDK vhost library. This behavior is supported and enabled by | |
567 | default, however in the case where the user knows that rx mergeable buffers are | |
568 | not needed i.e. jumbo frames are not needed, it can be forced off by adding | |
569 | ``mrg_rxbuf=off`` to the QEMU command line options. By not reserving multiple | |
570 | chains of descriptors it will make more individual virtio descriptors available | |
571 | for rx to the guest using dpdkvhost ports and this can improve performance. | |
167703d6 | 572 | |
00adb8d7 IM |
573 | Output Packet Batching |
574 | ~~~~~~~~~~~~~~~~~~~~~~ | |
575 | ||
576 | To make advantage of batched transmit functions, OVS collects packets in | |
577 | intermediate queues before sending when processing a batch of received packets. | |
578 | Even if packets are matched by different flows, OVS uses a single send | |
579 | operation for all packets destined to the same output port. | |
580 | ||
581 | Furthermore, OVS is able to buffer packets in these intermediate queues for a | |
582 | configurable amount of time to reduce the frequency of send bursts at medium | |
583 | load levels when the packet receive rate is high, but the receive batch size | |
584 | still very small. This is particularly beneficial for packets transmitted to | |
585 | VMs using an interrupt-driven virtio driver, where the interrupt overhead is | |
586 | significant for the OVS PMD, the host operating system and the guest driver. | |
587 | ||
588 | The ``tx-flush-interval`` parameter can be used to specify the time in | |
589 | microseconds OVS should wait between two send bursts to a given port (default | |
590 | is ``0``). When the intermediate queue fills up before that time is over, the | |
591 | buffered packet batch is sent immediately:: | |
592 | ||
593 | $ ovs-vsctl set Open_vSwitch . other_config:tx-flush-interval=50 | |
594 | ||
595 | This parameter influences both throughput and latency, depending on the traffic | |
596 | load on the port. In general lower values decrease latency while higher values | |
597 | may be useful to achieve higher throughput. | |
598 | ||
599 | Low traffic (``packet rate < 1 / tx-flush-interval``) should not experience | |
600 | any significant latency or throughput increase as packets are forwarded | |
601 | immediately. | |
602 | ||
603 | At intermediate load levels | |
604 | (``1 / tx-flush-interval < packet rate < 32 / tx-flush-interval``) traffic | |
605 | should experience an average latency increase of up to | |
606 | ``1 / 2 * tx-flush-interval`` and a possible throughput improvement. | |
607 | ||
608 | Very high traffic (``packet rate >> 32 / tx-flush-interval``) should experience | |
609 | the average latency increase equal to ``32 / (2 * packet rate)``. Most send | |
610 | batches in this case will contain the maximum number of packets (``32``). | |
611 | ||
612 | A ``tx-burst-interval`` value of ``50`` microseconds has shown to provide a | |
613 | good performance increase in a ``PHY-VM-PHY`` scenario on ``x86`` system for | |
614 | interrupt-driven guests while keeping the latency increase at a reasonable | |
615 | level: | |
616 | ||
617 | https://mail.openvswitch.org/pipermail/ovs-dev/2017-December/341628.html | |
618 | ||
619 | .. note:: | |
620 | Throughput impact of this option significantly depends on the scenario and | |
621 | the traffic patterns. For example: ``tx-burst-interval`` value of ``50`` | |
622 | microseconds shows performance degradation in ``PHY-VM-PHY`` with bonded PHY | |
623 | scenario while testing with ``256 - 1024`` packet flows: | |
624 | ||
625 | https://mail.openvswitch.org/pipermail/ovs-dev/2017-December/341700.html | |
626 | ||
627 | The average number of packets per output batch can be checked in PMD stats:: | |
628 | ||
629 | $ ovs-appctl dpif-netdev/pmd-stats-show | |
630 | ||
167703d6 SF |
631 | Limitations |
632 | ------------ | |
633 | ||
634 | - Currently DPDK ports does not use HW offload functionality. | |
635 | - Network Interface Firmware requirements: Each release of DPDK is | |
636 | validated against a specific firmware version for a supported Network | |
637 | Interface. New firmware versions introduce bug fixes, performance | |
638 | improvements and new functionality that DPDK leverages. The validated | |
639 | firmware versions are available as part of the release notes for | |
640 | DPDK. It is recommended that users update Network Interface firmware | |
641 | to match what has been validated for the DPDK release. | |
642 | ||
643 | The latest list of validated firmware versions can be found in the `DPDK | |
644 | release notes`_. | |
645 | ||
5e925ccc | 646 | .. _DPDK release notes: http://dpdk.org/doc/guides/rel_notes/release_17_11.html |
167703d6 | 647 | |
fa02b5bf IS |
648 | - Upper bound MTU: DPDK device drivers differ in how the L2 frame for a |
649 | given MTU value is calculated e.g. i40e driver includes 2 x vlan headers in | |
650 | MTU overhead, em driver includes 1 x vlan header, ixgbe driver does not | |
651 | include a vlan header in overhead. Currently it is not possible for OVS | |
652 | DPDK to know what upper bound MTU value is supported for a given device. | |
653 | As such OVS DPDK must provision for the case where the L2 frame for a given | |
654 | MTU includes 2 x vlan headers. This reduces the upper bound MTU value for | |
655 | devices that do not include vlan headers in their L2 frames by 8 bytes e.g. | |
656 | ixgbe devices upper bound MTU is reduced from 9710 to 9702. This work | |
657 | around is temporary and is expected to be removed once a method is provided | |
658 | by DPDK to query the upper bound MTU value for a given device. | |
659 | ||
795752a3 SF |
660 | Reporting Bugs |
661 | -------------- | |
167703d6 | 662 | |
795752a3 | 663 | Report problems to bugs@openvswitch.org. |