]> git.proxmox.com Git - ceph.git/blame - ceph/src/spdk/dpdk/doc/guides/nics/mlx4.rst
update source to Ceph Pacific 16.2.2
[ceph.git] / ceph / src / spdk / dpdk / doc / guides / nics / mlx4.rst
CommitLineData
11fdf7f2
TL
1.. SPDX-License-Identifier: BSD-3-Clause
2 Copyright 2012 6WIND S.A.
3 Copyright 2015 Mellanox Technologies, Ltd
7c673cae
FG
4
5MLX4 poll mode driver library
6=============================
7
8The MLX4 poll mode driver library (**librte_pmd_mlx4**) implements support
9for **Mellanox ConnectX-3** and **Mellanox ConnectX-3 Pro** 10/40 Gbps adapters
10as well as their virtual functions (VF) in SR-IOV context.
11
12Information and documentation about this family of adapters can be found on
13the `Mellanox website <http://www.mellanox.com>`_. Help is also provided by
14the `Mellanox community <http://community.mellanox.com/welcome>`_.
15
16There is also a `section dedicated to this poll mode driver
17<http://www.mellanox.com/page/products_dyn?product_family=209&mtag=pmd_for_dpdk>`_.
18
19.. note::
20
21 Due to external dependencies, this driver is disabled by default. It must
22 be enabled manually by setting ``CONFIG_RTE_LIBRTE_MLX4_PMD=y`` and
23 recompiling DPDK.
24
25Implementation details
26----------------------
27
28Most Mellanox ConnectX-3 devices provide two ports but expose a single PCI
29bus address, thus unlike most drivers, librte_pmd_mlx4 registers itself as a
30PCI driver that allocates one Ethernet device per detected port.
31
32For this reason, one cannot white/blacklist a single port without also
33white/blacklisting the others on the same device.
34
35Besides its dependency on libibverbs (that implies libmlx4 and associated
36kernel support), librte_pmd_mlx4 relies heavily on system calls for control
37operations such as querying/updating the MTU and flow control parameters.
38
39For security reasons and robustness, this driver only deals with virtual
40memory addresses. The way resources allocations are handled by the kernel
41combined with hardware specifications that allow it to handle virtual memory
42addresses directly ensure that DPDK applications cannot access random
43physical memory (or memory that does not belong to the current process).
44
45This capability allows the PMD to coexist with kernel network interfaces
46which remain functional, although they stop receiving unicast packets as
47long as they share the same MAC address.
48
9f95a23c
TL
49The :ref:`flow_isolated_mode` is supported.
50
7c673cae
FG
51Compiling librte_pmd_mlx4 causes DPDK to be linked against libibverbs.
52
7c673cae
FG
53Configuration
54-------------
55
56Compilation options
57~~~~~~~~~~~~~~~~~~~
58
59These options can be modified in the ``.config`` file.
60
61- ``CONFIG_RTE_LIBRTE_MLX4_PMD`` (default **n**)
62
63 Toggle compilation of librte_pmd_mlx4 itself.
64
9f95a23c 65- ``CONFIG_RTE_IBVERBS_LINK_DLOPEN`` (default **n**)
7c673cae 66
11fdf7f2
TL
67 Build PMD with additional code to make it loadable without hard
68 dependencies on **libibverbs** nor **libmlx4**, which may not be installed
69 on the target system.
7c673cae 70
11fdf7f2
TL
71 In this mode, their presence is still required for it to run properly,
72 however their absence won't prevent a DPDK application from starting (with
73 ``CONFIG_RTE_BUILD_SHARED_LIB`` disabled) and they won't show up as
74 missing with ``ldd(1)``.
7c673cae 75
11fdf7f2
TL
76 It works by moving these dependencies to a purpose-built rdma-core "glue"
77 plug-in which must either be installed in a directory whose name is based
78 on ``CONFIG_RTE_EAL_PMD_PATH`` suffixed with ``-glue`` if set, or in a
79 standard location for the dynamic linker (e.g. ``/lib``) if left to the
80 default empty string (``""``).
7c673cae 81
11fdf7f2 82 This option has no performance impact.
7c673cae 83
9f95a23c
TL
84- ``CONFIG_RTE_IBVERBS_LINK_STATIC`` (default **n**)
85
86 Embed static flavor of the dependencies **libibverbs** and **libmlx4**
87 in the PMD shared library or the executable static binary.
88
11fdf7f2 89- ``CONFIG_RTE_LIBRTE_MLX4_DEBUG`` (default **n**)
7c673cae 90
11fdf7f2
TL
91 Toggle debugging code and stricter compilation flags. Enabling this option
92 adds additional run-time checks and debugging messages at the cost of
93 lower performance.
7c673cae 94
f67539c2
TL
95This option is available in meson:
96
97- ``ibverbs_link`` can be ``static``, ``shared``, or ``dlopen``.
98
7c673cae
FG
99Environment variables
100~~~~~~~~~~~~~~~~~~~~~
101
11fdf7f2
TL
102- ``MLX4_GLUE_PATH``
103
104 A list of directories in which to search for the rdma-core "glue" plug-in,
105 separated by colons or semi-colons.
7c673cae 106
9f95a23c 107 Only matters when compiled with ``CONFIG_RTE_IBVERBS_LINK_DLOPEN``
11fdf7f2
TL
108 enabled and most useful when ``CONFIG_RTE_EAL_PMD_PATH`` is also set,
109 since ``LD_LIBRARY_PATH`` has no effect in this case.
7c673cae
FG
110
111Run-time configuration
112~~~~~~~~~~~~~~~~~~~~~~
113
7c673cae
FG
114- librte_pmd_mlx4 brings kernel network interfaces up during initialization
115 because it is affected by their state. Forcing them down prevents packets
116 reception.
117
118- **ethtool** operations on related kernel interfaces also affect the PMD.
119
11fdf7f2
TL
120- ``port`` parameter [int]
121
122 This parameter provides a physical port to probe and can be specified multiple
123 times for additional ports. All ports are probed by default if left
124 unspecified.
125
9f95a23c
TL
126- ``mr_ext_memseg_en`` parameter [int]
127
128 A nonzero value enables extending memseg when registering DMA memory. If
129 enabled, the number of entries in MR (Memory Region) lookup table on datapath
130 is minimized and it benefits performance. On the other hand, it worsens memory
131 utilization because registered memory is pinned by kernel driver. Even if a
132 page in the extended chunk is freed, that doesn't become reusable until the
133 entire memory is freed.
134
135 Enabled by default.
136
7c673cae
FG
137Kernel module parameters
138~~~~~~~~~~~~~~~~~~~~~~~~
139
140The **mlx4_core** kernel module has several parameters that affect the
141behavior and/or the performance of librte_pmd_mlx4. Some of them are described
142below.
143
144- **num_vfs** (integer or triplet, optionally prefixed by device address
145 strings)
146
147 Create the given number of VFs on the specified devices.
148
149- **log_num_mgm_entry_size** (integer)
150
151 Device-managed flow steering (DMFS) is required by DPDK applications. It is
152 enabled by using a negative value, the last four bits of which have a
153 special meaning.
154
155 - **-1**: force device-managed flow steering (DMFS).
156 - **-7**: configure optimized steering mode to improve performance with the
157 following limitation: VLAN filtering is not supported with this mode.
158 This is the recommended mode in case VLAN filter is not needed.
159
11fdf7f2
TL
160Limitations
161-----------
162
9f95a23c
TL
163- For secondary process:
164
165 - Forked secondary process not supported.
166 - External memory unregistered in EAL memseg list cannot be used for DMA
167 unless such memory has been registered by ``mlx4_mr_update_ext_mp()`` in
168 primary process and remapped to the same virtual address in secondary
169 process. If the external memory is registered by primary process but has
170 different virtual address in secondary process, unexpected error may happen.
171
11fdf7f2
TL
172- CRC stripping is supported by default and always reported as "true".
173 The ability to enable/disable CRC stripping requires OFED version
174 4.3-1.5.0.0 and above or rdma-core version v18 and above.
175
176- TSO (Transmit Segmentation Offload) is supported in OFED version
177 4.4 and above.
178
7c673cae
FG
179Prerequisites
180-------------
181
182This driver relies on external libraries and kernel drivers for resources
183allocations and initialization. The following dependencies are not part of
184DPDK and must be installed separately:
185
11fdf7f2 186- **libibverbs** (provided by rdma-core package)
7c673cae
FG
187
188 User space verbs framework used by librte_pmd_mlx4. This library provides
189 a generic interface between the kernel and low-level user space drivers
190 such as libmlx4.
191
192 It allows slow and privileged operations (context initialization, hardware
193 resources allocations) to be managed by the kernel and fast operations to
194 never leave user space.
195
11fdf7f2 196- **libmlx4** (provided by rdma-core package)
7c673cae
FG
197
198 Low-level user space driver library for Mellanox ConnectX-3 devices,
199 it is automatically loaded by libibverbs.
200
201 This library basically implements send/receive calls to the hardware
202 queues.
203
11fdf7f2 204- **Kernel modules**
7c673cae
FG
205
206 They provide the kernel-side verbs API and low level device drivers that
207 manage actual hardware initialization and resources sharing with user
208 space processes.
209
210 Unlike most other PMDs, these modules must remain loaded and bound to
211 their devices:
212
213 - mlx4_core: hardware driver managing Mellanox ConnectX-3 devices.
214 - mlx4_en: Ethernet device driver that provides kernel network interfaces.
215 - mlx4_ib: InifiniBand device driver.
216 - ib_uverbs: user space driver for verbs (entry point for libibverbs).
217
218- **Firmware update**
219
220 Mellanox OFED releases include firmware updates for ConnectX-3 adapters.
221
222 Because each release provides new features, these updates must be applied to
223 match the kernel modules and libraries they come with.
224
225.. note::
226
227 Both libraries are BSD and GPL licensed. Linux kernel modules are GPL
228 licensed.
229
11fdf7f2
TL
230Depending on system constraints and user preferences either RDMA core library
231with a recent enough Linux kernel release (recommended) or Mellanox OFED,
232which provides compatibility with older releases.
7c673cae 233
11fdf7f2
TL
234Current RDMA core package and Linux kernel (recommended)
235~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7c673cae 236
11fdf7f2
TL
237- Minimal Linux kernel version: 4.14.
238- Minimal RDMA core version: v15 (see `RDMA core installation documentation`_).
239
9f95a23c
TL
240- Starting with rdma-core v21, static libraries can be built::
241
242 cd build
243 CFLAGS=-fPIC cmake -DIN_PLACE=1 -DENABLE_STATIC=1 -GNinja ..
244 ninja
245
11fdf7f2 246.. _`RDMA core installation documentation`: https://raw.githubusercontent.com/linux-rdma/rdma-core/master/README.md
7c673cae 247
9f95a23c
TL
248If rdma-core libraries are built but not installed, DPDK makefile can link them,
249thanks to these environment variables:
250
251 - ``EXTRA_CFLAGS=-I/path/to/rdma-core/build/include``
252 - ``EXTRA_LDFLAGS=-L/path/to/rdma-core/build/lib``
253 - ``PKG_CONFIG_PATH=/path/to/rdma-core/build/lib/pkgconfig``
254
11fdf7f2 255.. _Mellanox_OFED_as_a_fallback:
7c673cae 256
11fdf7f2
TL
257Mellanox OFED as a fallback
258~~~~~~~~~~~~~~~~~~~~~~~~~~~
259
9f95a23c 260- `Mellanox OFED`_ version: **4.4, 4.5, 4.6**.
11fdf7f2
TL
261- firmware version: **2.42.5000** and above.
262
263.. _`Mellanox OFED`: http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers
7c673cae
FG
264
265.. note::
266
267 Several versions of Mellanox OFED are available. Installing the version
268 this DPDK release was developed and tested against is strongly
269 recommended. Please check the `prerequisites`_.
270
11fdf7f2
TL
271Installing Mellanox OFED
272^^^^^^^^^^^^^^^^^^^^^^^^
273
2741. Download latest Mellanox OFED.
275
2762. Install the required libraries and kernel modules either by installing
277 only the required set, or by installing the entire Mellanox OFED:
278
f67539c2 279 For bare metal use::
11fdf7f2
TL
280
281 ./mlnxofedinstall --dpdk --upstream-libs
282
f67539c2 283 For SR-IOV hypervisors use::
11fdf7f2
TL
284
285 ./mlnxofedinstall --dpdk --upstream-libs --enable-sriov --hypervisor
286
f67539c2 287 For SR-IOV virtual machine use::
11fdf7f2
TL
288
289 ./mlnxofedinstall --dpdk --upstream-libs --guest
290
f67539c2 2913. Verify the firmware is the correct one::
11fdf7f2
TL
292
293 ibv_devinfo
294
f67539c2 2954. Set all ports links to Ethernet, follow instructions on the screen::
11fdf7f2
TL
296
297 connectx_port_config
298
2995. Continue with :ref:`section 2 of the Quick Start Guide <QSG_2>`.
300
11fdf7f2
TL
301.. _qsg:
302
303Quick Start Guide
304-----------------
305
f67539c2 3061. Set all ports links to Ethernet::
11fdf7f2
TL
307
308 PCI=<NIC PCI address>
309 echo eth > "/sys/bus/pci/devices/$PCI/mlx4_port0"
310 echo eth > "/sys/bus/pci/devices/$PCI/mlx4_port1"
311
312 .. note::
313
314 If using Mellanox OFED one can permanently set the port link
315 to Ethernet using connectx_port_config tool provided by it.
316 :ref:`Mellanox_OFED_as_a_fallback`:
317
318.. _QSG_2:
319
3202. In case of bare metal or hypervisor, configure optimized steering mode
f67539c2 321 by adding the following line to ``/etc/modprobe.d/mlx4_core.conf``::
11fdf7f2
TL
322
323 options mlx4_core log_num_mgm_entry_size=-7
324
325 .. note::
326
327 If VLAN filtering is used, set log_num_mgm_entry_size=-1.
328 Performance degradation can occur on this case.
329
f67539c2 3303. Restart the driver::
11fdf7f2
TL
331
332 /etc/init.d/openibd restart
333
f67539c2 334 or::
11fdf7f2
TL
335
336 service openibd restart
337
3384. Compile DPDK and you are ready to go. See instructions on
339 :ref:`Development Kit Build System <Development_Kit_Build_System>`
340
341Performance tuning
342------------------
343
f67539c2 3441. Verify the optimized steering mode is configured::
11fdf7f2
TL
345
346 cat /sys/module/mlx4_core/parameters/log_num_mgm_entry_size
347
3482. Use the CPU near local NUMA node to which the PCIe adapter is connected,
349 for better performance. For VMs, verify that the right CPU
f67539c2 350 and NUMA node are pinned according to the above. Run::
11fdf7f2
TL
351
352 lstopo-no-graphics
353
354 to identify the NUMA node to which the PCIe adapter is connected.
355
3563. If more than one adapter is used, and root complex capabilities allow
357 to put both adapters on the same NUMA node without PCI bandwidth degradation,
358 it is recommended to locate both adapters on the same NUMA node.
359 This in order to forward packets from one to the other without
360 NUMA performance penalty.
361
f67539c2 3624. Disable pause frames::
11fdf7f2
TL
363
364 ethtool -A <netdev> rx off tx off
365
3665. Verify IO non-posted prefetch is disabled by default. This can be checked
367 via the BIOS configuration. Please contact you server provider for more
368 information about the settings.
369
370.. note::
371
372 On some machines, depends on the machine integrator, it is beneficial
373 to set the PCI max read request parameter to 1K. This can be
374 done in the following way:
375
f67539c2 376 To query the read request size use::
11fdf7f2
TL
377
378 setpci -s <NIC PCI address> 68.w
379
f67539c2 380 If the output is different than 3XXX, set it by::
11fdf7f2
TL
381
382 setpci -s <NIC PCI address> 68.w=3XXX
383
384 The XXX can be different on different systems. Make sure to configure
385 according to the setpci output.
386
3876. To minimize overhead of searching Memory Regions:
388
389 - '--socket-mem' is recommended to pin memory by predictable amount.
390 - Configure per-lcore cache when creating Mempools for packet buffer.
391 - Refrain from dynamically allocating/freeing memory in run-time.
392
7c673cae
FG
393Usage example
394-------------
395
396This section demonstrates how to launch **testpmd** with Mellanox ConnectX-3
397devices managed by librte_pmd_mlx4.
398
f67539c2 399#. Load the kernel modules::
7c673cae
FG
400
401 modprobe -a ib_uverbs mlx4_en mlx4_core mlx4_ib
402
403 Alternatively if MLNX_OFED is fully installed, the following script can
f67539c2 404 be run::
7c673cae
FG
405
406 /etc/init.d/openibd restart
407
408 .. note::
409
410 User space I/O kernel modules (uio and igb_uio) are not used and do
411 not have to be loaded.
412
413#. Make sure Ethernet interfaces are in working order and linked to kernel
f67539c2 414 verbs. Related sysfs entries should be present::
7c673cae
FG
415
416 ls -d /sys/class/net/*/device/infiniband_verbs/uverbs* | cut -d / -f 5
417
f67539c2 418 Example output::
7c673cae
FG
419
420 eth2
421 eth3
422 eth4
423 eth5
424
f67539c2 425#. Optionally, retrieve their PCI bus addresses for whitelisting::
7c673cae
FG
426
427 {
428 for intf in eth2 eth3 eth4 eth5;
429 do
430 (cd "/sys/class/net/${intf}/device/" && pwd -P);
431 done;
432 } |
433 sed -n 's,.*/\(.*\),-w \1,p'
434
f67539c2 435 Example output::
7c673cae
FG
436
437 -w 0000:83:00.0
438 -w 0000:83:00.0
439 -w 0000:84:00.0
440 -w 0000:84:00.0
441
442 .. note::
443
444 There are only two distinct PCI bus addresses because the Mellanox
445 ConnectX-3 adapters installed on this system are dual port.
446
f67539c2 447#. Request huge pages::
7c673cae
FG
448
449 echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages/nr_hugepages
450
f67539c2 451#. Start testpmd with basic parameters::
7c673cae 452
11fdf7f2 453 testpmd -l 8-15 -n 4 -w 0000:83:00.0 -w 0000:84:00.0 -- --rxq=2 --txq=2 -i
7c673cae 454
f67539c2 455 Example output::
7c673cae
FG
456
457 [...]
458 EAL: PCI device 0000:83:00.0 on NUMA socket 1
459 EAL: probe driver: 15b3:1007 librte_pmd_mlx4
460 PMD: librte_pmd_mlx4: PCI information matches, using device "mlx4_0" (VF: false)
461 PMD: librte_pmd_mlx4: 2 port(s) detected
462 PMD: librte_pmd_mlx4: port 1 MAC address is 00:02:c9:b5:b7:50
463 PMD: librte_pmd_mlx4: port 2 MAC address is 00:02:c9:b5:b7:51
464 EAL: PCI device 0000:84:00.0 on NUMA socket 1
465 EAL: probe driver: 15b3:1007 librte_pmd_mlx4
466 PMD: librte_pmd_mlx4: PCI information matches, using device "mlx4_1" (VF: false)
467 PMD: librte_pmd_mlx4: 2 port(s) detected
468 PMD: librte_pmd_mlx4: port 1 MAC address is 00:02:c9:b5:ba:b0
469 PMD: librte_pmd_mlx4: port 2 MAC address is 00:02:c9:b5:ba:b1
470 Interactive-mode selected
471 Configuring Port 0 (socket 0)
472 PMD: librte_pmd_mlx4: 0x867d60: TX queues number update: 0 -> 2
473 PMD: librte_pmd_mlx4: 0x867d60: RX queues number update: 0 -> 2
474 Port 0: 00:02:C9:B5:B7:50
475 Configuring Port 1 (socket 0)
476 PMD: librte_pmd_mlx4: 0x867da0: TX queues number update: 0 -> 2
477 PMD: librte_pmd_mlx4: 0x867da0: RX queues number update: 0 -> 2
478 Port 1: 00:02:C9:B5:B7:51
479 Configuring Port 2 (socket 0)
480 PMD: librte_pmd_mlx4: 0x867de0: TX queues number update: 0 -> 2
481 PMD: librte_pmd_mlx4: 0x867de0: RX queues number update: 0 -> 2
482 Port 2: 00:02:C9:B5:BA:B0
483 Configuring Port 3 (socket 0)
484 PMD: librte_pmd_mlx4: 0x867e20: TX queues number update: 0 -> 2
485 PMD: librte_pmd_mlx4: 0x867e20: RX queues number update: 0 -> 2
486 Port 3: 00:02:C9:B5:BA:B1
487 Checking link statuses...
488 Port 0 Link Up - speed 10000 Mbps - full-duplex
489 Port 1 Link Up - speed 40000 Mbps - full-duplex
490 Port 2 Link Up - speed 10000 Mbps - full-duplex
491 Port 3 Link Up - speed 40000 Mbps - full-duplex
492 Done
493 testpmd>