[ceph.git] / ceph / src / seastar / dpdk / doc / guides / faq / faq.rst

..  SPDX-License-Identifier: BSD-3-Clause
    Copyright(c) 2010-2014 Intel Corporation.

What does "EAL: map_all_hugepages(): open failed: Permission denied Cannot init memory" mean?
---------------------------------------------------------------------------------------------

This is most likely due to the test application not being run with sudo to promote the user to a superuser.
Alternatively, applications can also be run as regular user.
For more information, please refer to :ref:`DPDK Getting Started Guide <linux_gsg>`.


If I want to change the number of hugepages allocated, how do I remove the original pages allocated?
----------------------------------------------------------------------------------------------------

The number of pages allocated can be seen by executing the following command::

   grep Huge /proc/meminfo

Once all the pages are mmapped by an application, they stay that way.
If you start a test application with less than the maximum, then you have free pages.
When you stop and restart the test application, it looks to see if the pages are available in the ``/dev/huge`` directory and mmaps them.
If you look in the directory, you will see ``n`` number of 2M pages files. If you specified 1024, you will see 1024 page files.
These are then placed in memory segments to get contiguous memory.

If you need to change the number of pages, it is easier to first remove the pages. The usertools/dpdk-setup.sh script provides an option to do this.
See the "Quick Start Setup Script" section in the :ref:`DPDK Getting Started Guide <linux_gsg>` for more information.


If I execute "l2fwd -l 0-3 -m 64 -n 3 -- -p 3", I get the following output, indicating that there are no socket 0 hugepages to allocate the mbuf and ring structures to?
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I have set up a total of 1024 Hugepages (that is, allocated 512 2M pages to each NUMA node).

The -m command line parameter does not guarantee that huge pages will be reserved on specific sockets. Therefore, allocated huge pages may not be on socket 0.
To request memory to be reserved on a specific socket, please use the --socket-mem command-line parameter instead of -m.


I am running a 32-bit DPDK application on a NUMA system, and sometimes the application initializes fine but cannot allocate memory. Why is that happening?
----------------------------------------------------------------------------------------------------------------------------------------------------------

32-bit applications have limitations in terms of how much virtual memory is available, hence the number of hugepages they are able to allocate is also limited (1 GB size).
If your system has a lot (>1 GB size) of hugepage memory, not all of it will be allocated.
Due to hugepages typically being allocated on a local NUMA node, the hugepages allocation the application gets during the initialization depends on which
NUMA node it is running on (the EAL does not affinitize cores until much later in the initialization process).
Sometimes, the Linux OS runs the DPDK application on a core that is located on a different NUMA node from DPDK master core and
therefore all the hugepages are allocated on the wrong socket.

To avoid this scenario, either lower the amount of hugepage memory available to 1 GB size (or less), or run the application with taskset
affinitizing the application to a would-be master core.

For example, if your EAL coremask is 0xff0, the master core will usually be the first core in the coremask (0x10); this is what you have to supply to taskset::

   taskset 0x10 ./l2fwd -l 4-11 -n 2

.. Note: Instead of '-c 0xff0' use the '-l 4-11' as a cleaner way to define lcores.

In this way, the hugepages have a greater chance of being allocated to the correct socket.
Additionally, a ``--socket-mem`` option could be used to ensure the availability of memory for each socket, so that if hugepages were allocated on
the wrong socket, the application simply will not start.


On application startup, there is a lot of EAL information printed. Is there any way to reduce this?
---------------------------------------------------------------------------------------------------

Yes, the option ``--log-level=`` accepts either symbolic names (or numbers):

1. emergency
2. alert
3. critical
4. error
5. warning
6. notice
7. info
8. debug

How can I tune my network application to achieve lower latency?
---------------------------------------------------------------

Traditionally, there is a trade-off between throughput and latency. An application can be tuned to achieve a high throughput,
but the end-to-end latency of an average packet typically increases as a result.
Similarly, the application can be tuned to have, on average, a low end-to-end latency at the cost of lower throughput.

To achieve higher throughput, the DPDK attempts to aggregate the cost of processing each packet individually by processing packets in bursts.
Using the testpmd application as an example, the "burst" size can be set on the command line to a value of 32 (also the default value).
This allows the application to request 32 packets at a time from the PMD.
The testpmd application then immediately attempts to transmit all the packets that were received, in this case, all 32 packets.
The packets are not transmitted until the tail pointer is updated on the corresponding TX queue of the network port.
This behavior is desirable when tuning for high throughput because the cost of tail pointer updates to both the RX and TX queues
can be spread across 32 packets, effectively hiding the relatively slow MMIO cost of writing to the PCIe* device.

However, this is not very desirable when tuning for low latency, because the first packet that was received must also wait for the other 31 packets to be received.
It cannot be transmitted until the other 31 packets have also been processed because the NIC will not know to transmit the packets until the TX tail pointer has been updated,
which is not done until all 32 packets have been processed for transmission.

To consistently achieve low latency even under heavy system load, the application developer should avoid processing packets in bunches.
The testpmd application can be configured from the command line to use a burst value of 1.
This allows a single packet to be processed at a time, providing lower latency, but with the added cost of lower throughput.


Without NUMA enabled, my network throughput is low, why?
--------------------------------------------------------

I have a dual Intel® Xeon® E5645 processors 2.40 GHz with four Intel® 82599 10 Gigabit Ethernet NICs.
Using eight logical cores on each processor with RSS set to distribute network load from two 10 GbE interfaces to the cores on each processor.

Without NUMA enabled, memory is allocated from both sockets, since memory is interleaved.
Therefore, each 64B chunk is interleaved across both memory domains.

The first 64B chunk is mapped to node 0, the second 64B chunk is mapped to node 1, the third to node 0, the fourth to node 1.
If you allocated 256B, you would get memory that looks like this:

.. code-block:: console

    256B buffer
    Offset 0x00 - Node 0
    Offset 0x40 - Node 1
    Offset 0x80 - Node 0
    Offset 0xc0 - Node 1

Therefore, packet buffers and descriptor rings are allocated from both memory domains, thus incurring QPI bandwidth accessing the other memory and much higher latency.
For best performance with NUMA disabled, only one socket should be populated.


I am getting errors about not being able to open files. Why?
------------------------------------------------------------

As the DPDK operates, it opens a lot of files, which can result in reaching the open files limits, which is set using the ulimit command or in the limits.conf file.
This is especially true when using a large number (>512) of 2 MB huge pages. Please increase the open file limit if your application is not able to open files.
This can be done either by issuing a ulimit command or editing the limits.conf file. Please consult Linux manpages for usage information.


VF driver for IXGBE devices cannot be initialized
-------------------------------------------------

Some versions of Linux IXGBE driver do not assign a random MAC address to VF devices at initialization.
In this case, this has to be done manually on the VM host, using the following command:

.. code-block:: console

    ip link set <interface> vf <VF function> mac <MAC address>

where <interface> being the interface providing the virtual functions for example, eth0, <VF function> being the virtual function number, for example 0,
and <MAC address> being the desired MAC address.


Is it safe to add an entry to the hash table while running?
------------------------------------------------------------
Currently the table implementation is not a thread safe implementation and assumes that locking between threads and processes is handled by the user's application.
This is likely to be supported in future releases.


What is the purpose of setting iommu=pt?
----------------------------------------
DPDK uses a 1:1 mapping and does not support IOMMU. IOMMU allows for simpler VM physical address translation.
The second role of IOMMU is to allow protection from unwanted memory access by an unsafe device that has DMA privileges.
Unfortunately, the protection comes with an extremely high performance cost for high speed NICs.

Setting ``iommu=pt`` disables IOMMU support for the hypervisor.


When trying to send packets from an application to itself, meaning smac==dmac, using Intel(R) 82599 VF packets are lost.
------------------------------------------------------------------------------------------------------------------------

Check on register ``LLE(PFVMTXSSW[n])``, which allows an individual pool to send traffic and have it looped back to itself.


Can I split packet RX to use DPDK and have an application's higher order functions continue using Linux pthread?
----------------------------------------------------------------------------------------------------------------

The DPDK's lcore threads are Linux pthreads bound onto specific cores. Configure the DPDK to do work on the same
cores and run the application's other work on other cores using the DPDK's "coremask" setting to specify which
cores it should launch itself on.


Is it possible to exchange data between DPDK processes and regular userspace processes via some shared memory or IPC mechanism?
-------------------------------------------------------------------------------------------------------------------------------

Yes - DPDK processes are regular Linux/BSD processes, and can use all OS provided IPC mechanisms.


Can the multiple queues in Intel(R) I350 be used with DPDK?
-----------------------------------------------------------

I350 has RSS support and 8 queue pairs can be used in RSS mode. It should work with multi-queue DPDK applications using RSS.


How can hugepage-backed memory be shared among multiple processes?
------------------------------------------------------------------

See the Primary and Secondary examples in the :ref:`multi-process sample application <multi_process_app>`.


Why can't my application receive packets on my system with UEFI Secure Boot enabled?
------------------------------------------------------------------------------------

If UEFI secure boot is enabled, the Linux kernel may disallow the use of UIO on the system.
Therefore, devices for use by DPDK should be bound to the ``vfio-pci`` kernel module rather than ``igb_uio`` or ``uio_pci_generic``.
Commit	Line	Data
9f95a23c TL	1	.. SPDX-License-Identifier: BSD-3-Clause
9f95a23c TL	2	Copyright(c) 2010-2014 Intel Corporation.
7c673cae FG	3
	4	What does "EAL: map_all_hugepages(): open failed: Permission denied Cannot init memory" mean?
	5	---------------------------------------------------------------------------------------------
	6
	7	This is most likely due to the test application not being run with sudo to promote the user to a superuser.
	8	Alternatively, applications can also be run as regular user.
	9	For more information, please refer to :ref:`DPDK Getting Started Guide <linux_gsg>`.
	10
	11
9f95a23c TL	12	If I want to change the number of hugepages allocated, how do I remove the original pages allocated?
9f95a23c TL	13	----------------------------------------------------------------------------------------------------
7c673cae FG	14
	15	The number of pages allocated can be seen by executing the following command::
	16
	17	grep Huge /proc/meminfo
	18
	19	Once all the pages are mmapped by an application, they stay that way.
	20	If you start a test application with less than the maximum, then you have free pages.
	21	When you stop and restart the test application, it looks to see if the pages are available in the ``/dev/huge`` directory and mmaps them.
	22	If you look in the directory, you will see ``n`` number of 2M pages files. If you specified 1024, you will see 1024 page files.
	23	These are then placed in memory segments to get contiguous memory.
	24
11fdf7f2	25	If you need to change the number of pages, it is easier to first remove the pages. The usertools/dpdk-setup.sh script provides an option to do this.
7c673cae FG	26	See the "Quick Start Setup Script" section in the :ref:`DPDK Getting Started Guide <linux_gsg>` for more information.
	27
	28
11fdf7f2 TL	29	If I execute "l2fwd -l 0-3 -m 64 -n 3 -- -p 3", I get the following output, indicating that there are no socket 0 hugepages to allocate the mbuf and ring structures to?
11fdf7f2 TL	30	------------------------------------------------------------------------------------------------------------------------------------------------------------------------
7c673cae FG	31
	32	I have set up a total of 1024 Hugepages (that is, allocated 512 2M pages to each NUMA node).
	33
	34	The -m command line parameter does not guarantee that huge pages will be reserved on specific sockets. Therefore, allocated huge pages may not be on socket 0.
	35	To request memory to be reserved on a specific socket, please use the --socket-mem command-line parameter instead of -m.
	36
	37
	38	I am running a 32-bit DPDK application on a NUMA system, and sometimes the application initializes fine but cannot allocate memory. Why is that happening?
	39	----------------------------------------------------------------------------------------------------------------------------------------------------------
	40
11fdf7f2 TL	41	32-bit applications have limitations in terms of how much virtual memory is available, hence the number of hugepages they are able to allocate is also limited (1 GB size).
11fdf7f2 TL	42	If your system has a lot (>1 GB size) of hugepage memory, not all of it will be allocated.
7c673cae FG	43	Due to hugepages typically being allocated on a local NUMA node, the hugepages allocation the application gets during the initialization depends on which
	44	NUMA node it is running on (the EAL does not affinitize cores until much later in the initialization process).
	45	Sometimes, the Linux OS runs the DPDK application on a core that is located on a different NUMA node from DPDK master core and
	46	therefore all the hugepages are allocated on the wrong socket.
	47
11fdf7f2	48	To avoid this scenario, either lower the amount of hugepage memory available to 1 GB size (or less), or run the application with taskset
7c673cae FG	49	affinitizing the application to a would-be master core.
	50
	51	For example, if your EAL coremask is 0xff0, the master core will usually be the first core in the coremask (0x10); this is what you have to supply to taskset::
	52
11fdf7f2 TL	53	taskset 0x10 ./l2fwd -l 4-11 -n 2
	54
	55	.. Note: Instead of '-c 0xff0' use the '-l 4-11' as a cleaner way to define lcores.
7c673cae FG	56
	57	In this way, the hugepages have a greater chance of being allocated to the correct socket.
	58	Additionally, a ``--socket-mem`` option could be used to ensure the availability of memory for each socket, so that if hugepages were allocated on
	59	the wrong socket, the application simply will not start.
	60
	61
	62	On application startup, there is a lot of EAL information printed. Is there any way to reduce this?
	63	---------------------------------------------------------------------------------------------------
	64
9f95a23c	65	Yes, the option ``--log-level=`` accepts either symbolic names (or numbers):
7c673cae	66
9f95a23c TL	67	1. emergency
	68	2. alert
	69	3. critical
	70	4. error
	71	5. warning
	72	6. notice
	73	7. info
	74	8. debug
7c673cae FG	75
	76	How can I tune my network application to achieve lower latency?
	77	---------------------------------------------------------------
	78
	79	Traditionally, there is a trade-off between throughput and latency. An application can be tuned to achieve a high throughput,
	80	but the end-to-end latency of an average packet typically increases as a result.
	81	Similarly, the application can be tuned to have, on average, a low end-to-end latency at the cost of lower throughput.
	82
	83	To achieve higher throughput, the DPDK attempts to aggregate the cost of processing each packet individually by processing packets in bursts.
9f95a23c TL	84	Using the testpmd application as an example, the "burst" size can be set on the command line to a value of 32 (also the default value).
	85	This allows the application to request 32 packets at a time from the PMD.
	86	The testpmd application then immediately attempts to transmit all the packets that were received, in this case, all 32 packets.
7c673cae FG	87	The packets are not transmitted until the tail pointer is updated on the corresponding TX queue of the network port.
7c673cae FG	88	This behavior is desirable when tuning for high throughput because the cost of tail pointer updates to both the RX and TX queues
9f95a23c	89	can be spread across 32 packets, effectively hiding the relatively slow MMIO cost of writing to the PCIe* device.
7c673cae	90
9f95a23c TL	91	However, this is not very desirable when tuning for low latency, because the first packet that was received must also wait for the other 31 packets to be received.
	92	It cannot be transmitted until the other 31 packets have also been processed because the NIC will not know to transmit the packets until the TX tail pointer has been updated,
	93	which is not done until all 32 packets have been processed for transmission.
7c673cae FG	94
	95	To consistently achieve low latency even under heavy system load, the application developer should avoid processing packets in bunches.
	96	The testpmd application can be configured from the command line to use a burst value of 1.
	97	This allows a single packet to be processed at a time, providing lower latency, but with the added cost of lower throughput.
	98
	99
	100	Without NUMA enabled, my network throughput is low, why?
	101	--------------------------------------------------------
	102
	103	I have a dual Intel® Xeon® E5645 processors 2.40 GHz with four Intel® 82599 10 Gigabit Ethernet NICs.
	104	Using eight logical cores on each processor with RSS set to distribute network load from two 10 GbE interfaces to the cores on each processor.
	105
	106	Without NUMA enabled, memory is allocated from both sockets, since memory is interleaved.
	107	Therefore, each 64B chunk is interleaved across both memory domains.
	108
	109	The first 64B chunk is mapped to node 0, the second 64B chunk is mapped to node 1, the third to node 0, the fourth to node 1.
	110	If you allocated 256B, you would get memory that looks like this:
	111
	112	.. code-block:: console
	113
	114	256B buffer
	115	Offset 0x00 - Node 0
	116	Offset 0x40 - Node 1
	117	Offset 0x80 - Node 0
	118	Offset 0xc0 - Node 1
	119
	120	Therefore, packet buffers and descriptor rings are allocated from both memory domains, thus incurring QPI bandwidth accessing the other memory and much higher latency.
	121	For best performance with NUMA disabled, only one socket should be populated.
	122
	123
	124	I am getting errors about not being able to open files. Why?
	125	------------------------------------------------------------
	126
	127	As the DPDK operates, it opens a lot of files, which can result in reaching the open files limits, which is set using the ulimit command or in the limits.conf file.
	128	This is especially true when using a large number (>512) of 2 MB huge pages. Please increase the open file limit if your application is not able to open files.
	129	This can be done either by issuing a ulimit command or editing the limits.conf file. Please consult Linux manpages for usage information.
	130
	131
	132	VF driver for IXGBE devices cannot be initialized
	133	-------------------------------------------------
	134
	135	Some versions of Linux IXGBE driver do not assign a random MAC address to VF devices at initialization.
	136	In this case, this has to be done manually on the VM host, using the following command:
	137
	138	.. code-block:: console
	139
	140	ip link set <interface> vf <VF function> mac <MAC address>
	141
	142	where <interface> being the interface providing the virtual functions for example, eth0, <VF function> being the virtual function number, for example 0,
	143	and <MAC address> being the desired MAC address.
	144
	145
	146	Is it safe to add an entry to the hash table while running?
	147	------------------------------------------------------------
	148	Currently the table implementation is not a thread safe implementation and assumes that locking between threads and processes is handled by the user's application.
	149	This is likely to be supported in future releases.
	150
	151
	152	What is the purpose of setting iommu=pt?
	153	----------------------------------------
	154	DPDK uses a 1:1 mapping and does not support IOMMU. IOMMU allows for simpler VM physical address translation.
	155	The second role of IOMMU is to allow protection from unwanted memory access by an unsafe device that has DMA privileges.
	156	Unfortunately, the protection comes with an extremely high performance cost for high speed NICs.
	157
158	Setting ``iommu=pt`` disables IOMMU support for the hypervisor.
159
160
161	When trying to send packets from an application to itself, meaning smac==dmac, using Intel(R) 82599 VF packets are lost.
162	------------------------------------------------------------------------------------------------------------------------
163
164	Check on register ``LLE(PFVMTXSSW[n])``, which allows an individual pool to send traffic and have it looped back to itself.
165
166
167	Can I split packet RX to use DPDK and have an application's higher order functions continue using Linux pthread?
168	----------------------------------------------------------------------------------------------------------------
169
170	The DPDK's lcore threads are Linux pthreads bound onto specific cores. Configure the DPDK to do work on the same
171	cores and run the application's other work on other cores using the DPDK's "coremask" setting to specify which
172	cores it should launch itself on.
173
174
175	Is it possible to exchange data between DPDK processes and regular userspace processes via some shared memory or IPC mechanism?
176	-------------------------------------------------------------------------------------------------------------------------------
177
178	Yes - DPDK processes are regular Linux/BSD processes, and can use all OS provided IPC mechanisms.
179
180
181	Can the multiple queues in Intel(R) I350 be used with DPDK?
182	-----------------------------------------------------------
183
184	I350 has RSS support and 8 queue pairs can be used in RSS mode. It should work with multi-queue DPDK applications using RSS.
185
186
187	How can hugepage-backed memory be shared among multiple processes?
188	------------------------------------------------------------------
189
190	See the Primary and Secondary examples in the :ref:`multi-process sample application <multi_process_app>`.
9f95a23c TL	191
	192
	193	Why can't my application receive packets on my system with UEFI Secure Boot enabled?
	194	------------------------------------------------------------------------------------
	195
	196	If UEFI secure boot is enabled, the Linux kernel may disallow the use of UIO on the system.
	197	Therefore, devices for use by DPDK should be bound to the ``vfio-pci`` kernel module rather than ``igb_uio`` or ``uio_pci_generic``.