[ceph.git] / ceph / src / dpdk / doc / guides / sample_app_ug / load_balancer.rst

..  BSD LICENSE
    Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
    All rights reserved.

    Redistribution and use in source and binary forms, with or without
    modification, are permitted provided that the following conditions
    are met:

    * Redistributions of source code must retain the above copyright
    notice, this list of conditions and the following disclaimer.
    * Redistributions in binary form must reproduce the above copyright
    notice, this list of conditions and the following disclaimer in
    the documentation and/or other materials provided with the
    distribution.
    * Neither the name of Intel Corporation nor the names of its
    contributors may be used to endorse or promote products derived
    from this software without specific prior written permission.

    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Load Balancer Sample Application
================================

The Load Balancer sample application demonstrates the concept of isolating the packet I/O task
from the application-specific workload.
Depending on the performance target,
a number of logical cores (lcores) are dedicated to handle the interaction with the NIC ports (I/O lcores),
while the rest of the lcores are dedicated to performing the application processing (worker lcores).
The worker lcores are totally oblivious to the intricacies of the packet I/O activity and
use the NIC-agnostic interface provided by software rings to exchange packets with the I/O cores.

Overview
--------

The architecture of the Load Balance application is presented in the following figure.

.. _figure_load_bal_app_arch:

.. figure:: img/load_bal_app_arch.*

   Load Balancer Application Architecture


For the sake of simplicity, the diagram illustrates a specific case of two I/O RX and two I/O TX lcores off loading the packet I/O
overhead incurred by four NIC ports from four worker cores, with each I/O lcore handling RX/TX for two NIC ports.

I/O RX Logical Cores
~~~~~~~~~~~~~~~~~~~~

Each I/O RX lcore performs packet RX from its assigned NIC RX rings and then distributes the received packets to the worker threads.
The application allows each I/O RX lcore to communicate with any of the worker threads,
therefore each (I/O RX lcore, worker lcore) pair is connected through a dedicated single producer - single consumer software ring.

The worker lcore to handle the current packet is determined by reading a predefined 1-byte field from the input packet:

worker_id = packet[load_balancing_field] % n_workers

Since all the packets that are part of the same traffic flow are expected to have the same value for the load balancing field,
this scheme also ensures that all the packets that are part of the same traffic flow are directed to the same worker lcore (flow affinity)
in the same order they enter the system (packet ordering).

I/O TX Logical Cores
~~~~~~~~~~~~~~~~~~~~

Each I/O lcore owns the packet TX for a predefined set of NIC ports. To enable each worker thread to send packets to any NIC TX port,
the application creates a software ring for each (worker lcore, NIC TX port) pair,
with each I/O TX core handling those software rings that are associated with NIC ports that it handles.

Worker Logical Cores
~~~~~~~~~~~~~~~~~~~~

Each worker lcore reads packets from its set of input software rings and
routes them to the NIC ports for transmission by dispatching them to output software rings.
The routing logic is LPM based, with all the worker threads sharing the same LPM rules.

Compiling the Application
-------------------------

The sequence of steps used to build the application is:

#.  Export the required environment variables:

    .. code-block:: console

        export RTE_SDK=<Path to the DPDK installation folder>
        export RTE_TARGET=x86_64-native-linuxapp-gcc

#.  Build the application executable file:

    .. code-block:: console

        cd ${RTE_SDK}/examples/load_balancer
        make

    For more details on how to build the DPDK libraries and sample applications,
    please refer to the *DPDK Getting Started Guide.*

Running the Application
-----------------------

To successfully run the application,
the command line used to start the application has to be in sync with the traffic flows configured on the traffic generator side.

For examples of application command lines and traffic generator flows, please refer to the DPDK Test Report.
For more details on how to set up and run the sample applications provided with DPDK package,
please refer to the *DPDK Getting Started Guide*.

Explanation
-----------

Application Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~

The application run-time configuration is done through the application command line parameters.
Any parameter that is not specified as mandatory is optional,
with the default value hard-coded in the main.h header file from the application folder.

The list of application command line parameters is listed below:

#.  --rx "(PORT, QUEUE, LCORE), ...": The list of NIC RX ports and queues handled by the I/O RX lcores.
    This parameter also implicitly defines the list of I/O RX lcores. This is a mandatory parameter.

#.  --tx "(PORT, LCORE), ... ": The list of NIC TX ports handled by the I/O TX lcores.
    This parameter also implicitly defines the list of I/O TX lcores.
    This is a mandatory parameter.

#.  --w "LCORE, ...": The list of the worker lcores. This is a mandatory parameter.

#.  --lpm "IP / PREFIX => PORT; ...": The list of LPM rules used by the worker lcores for packet forwarding.
    This is a mandatory parameter.

#.  --rsz "A, B, C, D": Ring sizes:

    #.  A = The size (in number of buffer descriptors) of each of the NIC RX rings read by the I/O RX lcores.

    #.  B = The size (in number of elements) of each of the software rings used by the I/O RX lcores to send packets to worker lcores.

    #.  C = The size (in number of elements) of each of the software rings used by the worker lcores to send packets to I/O TX lcores.

    #.  D = The size (in number of buffer descriptors) of each of the NIC TX rings written by I/O TX lcores.

#.  --bsz "(A, B), (C, D), (E, F)": Burst sizes:

    #.  A = The I/O RX lcore read burst size from NIC RX.

    #.  B = The I/O RX lcore write burst size to the output software rings.

    #.  C = The worker lcore read burst size from the input software rings.

    #.  D = The worker lcore write burst size to the output software rings.

    #.  E = The I/O TX lcore read burst size from the input software rings.

    #.  F = The I/O TX lcore write burst size to the NIC TX.

#.  --pos-lb POS: The position of the 1-byte field within the input packet used by the I/O RX lcores
    to identify the worker lcore for the current packet.
    This field needs to be within the first 64 bytes of the input packet.

The infrastructure of software rings connecting I/O lcores and worker lcores is built by the application
as a result of the application configuration provided by the user through the application command line parameters.

A specific lcore performing the I/O RX role for a specific set of NIC ports can also perform the I/O TX role
for the same or a different set of NIC ports.
A specific lcore cannot perform both the I/O role (either RX or TX) and the worker role during the same session.

Example:

.. code-block:: console

    ./load_balancer -c 0xf8 -n 4 -- --rx "(0,0,3),(1,0,3)" --tx "(0,3),(1,3)" --w "4,5,6,7" --lpm "1.0.0.0/24=>0; 1.0.1.0/24=>1;" --pos-lb 29

There is a single I/O lcore (lcore 3) that handles RX and TX for two NIC ports (ports 0 and 1) that
handles packets to/from four worker lcores (lcores 4, 5, 6 and 7) that
are assigned worker IDs 0 to 3 (worker ID for lcore 4 is 0, for lcore 5 is 1, for lcore 6 is 2 and for lcore 7 is 3).

Assuming that all the input packets are IPv4 packets with no VLAN label and the source IP address of the current packet is A.B.C.D,
the worker lcore for the current packet is determined by byte D (which is byte 29).
There are two LPM rules that are used by each worker lcore to route packets to the output NIC ports.

The following table illustrates the packet flow through the system for several possible traffic flows:

+------------+----------------+-----------------+------------------------------+--------------+
| **Flow #** | **Source**     | **Destination** | **Worker ID (Worker lcore)** | **Output**   |
|            | **IP Address** | **IP Address**  |                              | **NIC Port** |
|            |                |                 |                              |              |
+============+================+=================+==============================+==============+
| 1          | 0.0.0.0        | 1.0.0.1         | 0 (4)                        | 0            |
|            |                |                 |                              |              |
+------------+----------------+-----------------+------------------------------+--------------+
| 2          | 0.0.0.1        | 1.0.1.2         | 1 (5)                        | 1            |
|            |                |                 |                              |              |
+------------+----------------+-----------------+------------------------------+--------------+
| 3          | 0.0.0.14       | 1.0.0.3         | 2 (6)                        | 0            |
|            |                |                 |                              |              |
+------------+----------------+-----------------+------------------------------+--------------+
| 4          | 0.0.0.15       | 1.0.1.4         | 3 (7)                        | 1            |
|            |                |                 |                              |              |
+------------+----------------+-----------------+------------------------------+--------------+

NUMA Support
~~~~~~~~~~~~

The application has built-in performance enhancements for the NUMA case:

#.  One buffer pool per each CPU socket.

#.  One LPM table per each CPU socket.

#.  Memory for the NIC RX or TX rings is allocated on the same socket with the lcore handling the respective ring.

In the case where multiple CPU sockets are used in the system,
it is recommended to enable at least one lcore to fulfill the I/O role for the NIC ports that
are directly attached to that CPU socket through the PCI Express* bus.
It is always recommended to handle the packet I/O with lcores from the same CPU socket as the NICs.

Depending on whether the I/O RX lcore (same CPU socket as NIC RX),
the worker lcore and the I/O TX lcore (same CPU socket as NIC TX) handling a specific input packet,
are on the same or different CPU sockets, the following run-time scenarios are possible:

#.  AAA: The packet is received, processed and transmitted without going across CPU sockets.

#.  AAB: The packet is received and processed on socket A,
    but as it has to be transmitted on a NIC port connected to socket B,
    the packet is sent to socket B through software rings.

#.  ABB: The packet is received on socket A, but as it has to be processed by a worker lcore on socket B,
    the packet is sent to socket B through software rings.
    The packet is transmitted by a NIC port connected to the same CPU socket as the worker lcore that processed it.

#.  ABC: The packet is received on socket A, it is processed by an lcore on socket B,
    then it has to be transmitted out by a NIC connected to socket C.
    The performance price for crossing the CPU socket boundary is paid twice for this packet.
Commit	Line	Data
7c673cae FG	1	.. BSD LICENSE
	2	Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
	3	All rights reserved.
	4
	5	Redistribution and use in source and binary forms, with or without
	6	modification, are permitted provided that the following conditions
	7	are met:
	8
	9	* Redistributions of source code must retain the above copyright
	10	notice, this list of conditions and the following disclaimer.
	11	* Redistributions in binary form must reproduce the above copyright
	12	notice, this list of conditions and the following disclaimer in
	13	the documentation and/or other materials provided with the
	14	distribution.
	15	* Neither the name of Intel Corporation nor the names of its
	16	contributors may be used to endorse or promote products derived
	17	from this software without specific prior written permission.
	18
	19	THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
	20	"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
	21	LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
	22	A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
	23	OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
	24	SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
	25	LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
	26	DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
	27	THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
	28	(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
	29	OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	30
	31	Load Balancer Sample Application
	32	================================
	33
	34	The Load Balancer sample application demonstrates the concept of isolating the packet I/O task
	35	from the application-specific workload.
	36	Depending on the performance target,
	37	a number of logical cores (lcores) are dedicated to handle the interaction with the NIC ports (I/O lcores),
	38	while the rest of the lcores are dedicated to performing the application processing (worker lcores).
	39	The worker lcores are totally oblivious to the intricacies of the packet I/O activity and
	40	use the NIC-agnostic interface provided by software rings to exchange packets with the I/O cores.
	41
	42	Overview
	43	--------
	44
	45	The architecture of the Load Balance application is presented in the following figure.
	46
	47	.. _figure_load_bal_app_arch:
	48
	49	.. figure:: img/load_bal_app_arch.*
	50
	51	Load Balancer Application Architecture
	52
	53
	54	For the sake of simplicity, the diagram illustrates a specific case of two I/O RX and two I/O TX lcores off loading the packet I/O
	55	overhead incurred by four NIC ports from four worker cores, with each I/O lcore handling RX/TX for two NIC ports.
	56
	57	I/O RX Logical Cores
	58	~~~~~~~~~~~~~~~~~~~~
	59
	60	Each I/O RX lcore performs packet RX from its assigned NIC RX rings and then distributes the received packets to the worker threads.
	61	The application allows each I/O RX lcore to communicate with any of the worker threads,
	62	therefore each (I/O RX lcore, worker lcore) pair is connected through a dedicated single producer - single consumer software ring.
	63
	64	The worker lcore to handle the current packet is determined by reading a predefined 1-byte field from the input packet:
65
66	worker_id = packet[load_balancing_field] % n_workers
67
68	Since all the packets that are part of the same traffic flow are expected to have the same value for the load balancing field,
69	this scheme also ensures that all the packets that are part of the same traffic flow are directed to the same worker lcore (flow affinity)
70	in the same order they enter the system (packet ordering).
71
72	I/O TX Logical Cores
73	~~~~~~~~~~~~~~~~~~~~
74
75	Each I/O lcore owns the packet TX for a predefined set of NIC ports. To enable each worker thread to send packets to any NIC TX port,
76	the application creates a software ring for each (worker lcore, NIC TX port) pair,
77	with each I/O TX core handling those software rings that are associated with NIC ports that it handles.
78
79	Worker Logical Cores
80	~~~~~~~~~~~~~~~~~~~~
81
82	Each worker lcore reads packets from its set of input software rings and
83	routes them to the NIC ports for transmission by dispatching them to output software rings.
84	The routing logic is LPM based, with all the worker threads sharing the same LPM rules.
85
86	Compiling the Application
87	-------------------------
88
89	The sequence of steps used to build the application is:
90
91	#. Export the required environment variables:
92
93	.. code-block:: console
94
95	export RTE_SDK=<Path to the DPDK installation folder>
96	export RTE_TARGET=x86_64-native-linuxapp-gcc
97
98	#. Build the application executable file:
99
100	.. code-block:: console
101
102	cd ${RTE_SDK}/examples/load_balancer
103	make
104
105	For more details on how to build the DPDK libraries and sample applications,
106	please refer to the DPDK Getting Started Guide.
107
108	Running the Application
109	-----------------------
110
111	To successfully run the application,
112	the command line used to start the application has to be in sync with the traffic flows configured on the traffic generator side.
113
114	For examples of application command lines and traffic generator flows, please refer to the DPDK Test Report.
115	For more details on how to set up and run the sample applications provided with DPDK package,
116	please refer to the DPDK Getting Started Guide.
117
118	Explanation
119	-----------
120
121	Application Configuration
122	~~~~~~~~~~~~~~~~~~~~~~~~~
123
124	The application run-time configuration is done through the application command line parameters.
125	Any parameter that is not specified as mandatory is optional,
126	with the default value hard-coded in the main.h header file from the application folder.
127
128	The list of application command line parameters is listed below:
129
130	#. --rx "(PORT, QUEUE, LCORE), ...": The list of NIC RX ports and queues handled by the I/O RX lcores.
131	This parameter also implicitly defines the list of I/O RX lcores. This is a mandatory parameter.
132
133	#. --tx "(PORT, LCORE), ... ": The list of NIC TX ports handled by the I/O TX lcores.
134	This parameter also implicitly defines the list of I/O TX lcores.
135	This is a mandatory parameter.
136
137	#. --w "LCORE, ...": The list of the worker lcores. This is a mandatory parameter.
138
139	#. --lpm "IP / PREFIX => PORT; ...": The list of LPM rules used by the worker lcores for packet forwarding.
140	This is a mandatory parameter.
141
142	#. --rsz "A, B, C, D": Ring sizes:
143
144	#. A = The size (in number of buffer descriptors) of each of the NIC RX rings read by the I/O RX lcores.
145
146	#. B = The size (in number of elements) of each of the software rings used by the I/O RX lcores to send packets to worker lcores.
147
148	#. C = The size (in number of elements) of each of the software rings used by the worker lcores to send packets to I/O TX lcores.
149
150	#. D = The size (in number of buffer descriptors) of each of the NIC TX rings written by I/O TX lcores.
151
152	#. --bsz "(A, B), (C, D), (E, F)": Burst sizes:
153
154	#. A = The I/O RX lcore read burst size from NIC RX.
155
156	#. B = The I/O RX lcore write burst size to the output software rings.
157
158	#. C = The worker lcore read burst size from the input software rings.
159
160	#. D = The worker lcore write burst size to the output software rings.
161
162	#. E = The I/O TX lcore read burst size from the input software rings.
163
164	#. F = The I/O TX lcore write burst size to the NIC TX.
165
166	#. --pos-lb POS: The position of the 1-byte field within the input packet used by the I/O RX lcores
167	to identify the worker lcore for the current packet.
168	This field needs to be within the first 64 bytes of the input packet.
169
170	The infrastructure of software rings connecting I/O lcores and worker lcores is built by the application
171	as a result of the application configuration provided by the user through the application command line parameters.
172
173	A specific lcore performing the I/O RX role for a specific set of NIC ports can also perform the I/O TX role
174	for the same or a different set of NIC ports.
175	A specific lcore cannot perform both the I/O role (either RX or TX) and the worker role during the same session.
176
177	Example:
178
179	.. code-block:: console
180
181	./load_balancer -c 0xf8 -n 4 -- --rx "(0,0,3),(1,0,3)" --tx "(0,3),(1,3)" --w "4,5,6,7" --lpm "1.0.0.0/24=>0; 1.0.1.0/24=>1;" --pos-lb 29
182
183	There is a single I/O lcore (lcore 3) that handles RX and TX for two NIC ports (ports 0 and 1) that
184	handles packets to/from four worker lcores (lcores 4, 5, 6 and 7) that
185	are assigned worker IDs 0 to 3 (worker ID for lcore 4 is 0, for lcore 5 is 1, for lcore 6 is 2 and for lcore 7 is 3).
186
187	Assuming that all the input packets are IPv4 packets with no VLAN label and the source IP address of the current packet is A.B.C.D,
188	the worker lcore for the current packet is determined by byte D (which is byte 29).
189	There are two LPM rules that are used by each worker lcore to route packets to the output NIC ports.
190
191	The following table illustrates the packet flow through the system for several possible traffic flows:
192
193	+------------+----------------+-----------------+------------------------------+--------------+
194	\| Flow # \| Source \| Destination \| Worker ID (Worker lcore) \| Output \|
195	\| \| IP Address \| IP Address \| \| NIC Port \|
196	\| \| \| \| \| \|
197	+============+================+=================+==============================+==============+
198	\| 1 \| 0.0.0.0 \| 1.0.0.1 \| 0 (4) \| 0 \|
199	\| \| \| \| \| \|
200	+------------+----------------+-----------------+------------------------------+--------------+
201	\| 2 \| 0.0.0.1 \| 1.0.1.2 \| 1 (5) \| 1 \|
202	\| \| \| \| \| \|
203	+------------+----------------+-----------------+------------------------------+--------------+
204	\| 3 \| 0.0.0.14 \| 1.0.0.3 \| 2 (6) \| 0 \|
205	\| \| \| \| \| \|
206	+------------+----------------+-----------------+------------------------------+--------------+
207	\| 4 \| 0.0.0.15 \| 1.0.1.4 \| 3 (7) \| 1 \|
208	\| \| \| \| \| \|
209	+------------+----------------+-----------------+------------------------------+--------------+
210
211	NUMA Support
212	~~~~~~~~~~~~
213
214	The application has built-in performance enhancements for the NUMA case:
215
216	#. One buffer pool per each CPU socket.
217
218	#. One LPM table per each CPU socket.
219
220	#. Memory for the NIC RX or TX rings is allocated on the same socket with the lcore handling the respective ring.
221
222	In the case where multiple CPU sockets are used in the system,
223	it is recommended to enable at least one lcore to fulfill the I/O role for the NIC ports that
224	are directly attached to that CPU socket through the PCI Express* bus.
225	It is always recommended to handle the packet I/O with lcores from the same CPU socket as the NICs.
226
227	Depending on whether the I/O RX lcore (same CPU socket as NIC RX),
228	the worker lcore and the I/O TX lcore (same CPU socket as NIC TX) handling a specific input packet,
229	are on the same or different CPU sockets, the following run-time scenarios are possible:
230
231	#. AAA: The packet is received, processed and transmitted without going across CPU sockets.
232
233	#. AAB: The packet is received and processed on socket A,
234	but as it has to be transmitted on a NIC port connected to socket B,
235	the packet is sent to socket B through software rings.
236
237	#. ABB: The packet is received on socket A, but as it has to be processed by a worker lcore on socket B,
238	the packet is sent to socket B through software rings.
239	The packet is transmitted by a NIC port connected to the same CPU socket as the worker lcore that processed it.
240
241	#. ABC: The packet is received on socket A, it is processed by an lcore on socket B,
242	then it has to be transmitted out by a NIC connected to socket C.
243	The performance price for crossing the CPU socket boundary is paid twice for this packet.