]>
Commit | Line | Data |
---|---|---|
9f95a23c TL |
1 | .. SPDX-License-Identifier: BSD-3-Clause |
2 | Copyright(c) 2015 Intel Corporation. | |
7c673cae FG |
3 | |
4 | Performance Thread Sample Application | |
5 | ===================================== | |
6 | ||
7 | The performance thread sample application is a derivative of the standard L3 | |
8 | forwarding application that demonstrates different threading models. | |
9 | ||
10 | Overview | |
11 | -------- | |
12 | For a general description of the L3 forwarding applications capabilities | |
13 | please refer to the documentation of the standard application in | |
14 | :doc:`l3_forward`. | |
15 | ||
16 | The performance thread sample application differs from the standard L3 | |
17 | forwarding example in that it divides the TX and RX processing between | |
18 | different threads, and makes it possible to assign individual threads to | |
19 | different cores. | |
20 | ||
21 | Three threading models are considered: | |
22 | ||
23 | #. When there is one EAL thread per physical core. | |
24 | #. When there are multiple EAL threads per physical core. | |
25 | #. When there are multiple lightweight threads per EAL thread. | |
26 | ||
27 | Since DPDK release 2.0 it is possible to launch applications using the | |
28 | ``--lcores`` EAL parameter, specifying cpu-sets for a physical core. With the | |
29 | performance thread sample application its is now also possible to assign | |
30 | individual RX and TX functions to different cores. | |
31 | ||
32 | As an alternative to dividing the L3 forwarding work between different EAL | |
33 | threads the performance thread sample introduces the possibility to run the | |
34 | application threads as lightweight threads (L-threads) within one or | |
35 | more EAL threads. | |
36 | ||
37 | In order to facilitate this threading model the example includes a primitive | |
38 | cooperative scheduler (L-thread) subsystem. More details of the L-thread | |
39 | subsystem can be found in :ref:`lthread_subsystem`. | |
40 | ||
41 | **Note:** Whilst theoretically possible it is not anticipated that multiple | |
42 | L-thread schedulers would be run on the same physical core, this mode of | |
43 | operation should not be expected to yield useful performance and is considered | |
44 | invalid. | |
45 | ||
46 | Compiling the Application | |
47 | ------------------------- | |
7c673cae | 48 | |
9f95a23c | 49 | To compile the sample application see :doc:`compiling`. |
7c673cae | 50 | |
9f95a23c | 51 | The application is located in the `performance-thread/l3fwd-thread` sub-directory. |
7c673cae FG |
52 | |
53 | Running the Application | |
54 | ----------------------- | |
55 | ||
56 | The application has a number of command line options:: | |
57 | ||
58 | ./build/l3fwd-thread [EAL options] -- | |
59 | -p PORTMASK [-P] | |
60 | --rx(port,queue,lcore,thread)[,(port,queue,lcore,thread)] | |
61 | --tx(lcore,thread)[,(lcore,thread)] | |
62 | [--enable-jumbo] [--max-pkt-len PKTLEN]] [--no-numa] | |
63 | [--hash-entry-num] [--ipv6] [--no-lthreads] [--stat-lcore lcore] | |
11fdf7f2 | 64 | [--parse-ptype] |
7c673cae FG |
65 | |
66 | Where: | |
67 | ||
68 | * ``-p PORTMASK``: Hexadecimal bitmask of ports to configure. | |
69 | ||
70 | * ``-P``: optional, sets all ports to promiscuous mode so that packets are | |
71 | accepted regardless of the packet's Ethernet MAC destination address. | |
72 | Without this option, only packets with the Ethernet MAC destination address | |
73 | set to the Ethernet address of the port are accepted. | |
74 | ||
75 | * ``--rx (port,queue,lcore,thread)[,(port,queue,lcore,thread)]``: the list of | |
76 | NIC RX ports and queues handled by the RX lcores and threads. The parameters | |
77 | are explained below. | |
78 | ||
79 | * ``--tx (lcore,thread)[,(lcore,thread)]``: the list of TX threads identifying | |
80 | the lcore the thread runs on, and the id of RX thread with which it is | |
81 | associated. The parameters are explained below. | |
82 | ||
83 | * ``--enable-jumbo``: optional, enables jumbo frames. | |
84 | ||
85 | * ``--max-pkt-len``: optional, maximum packet length in decimal (64-9600). | |
86 | ||
87 | * ``--no-numa``: optional, disables numa awareness. | |
88 | ||
89 | * ``--hash-entry-num``: optional, specifies the hash entry number in hex to be | |
90 | setup. | |
91 | ||
92 | * ``--ipv6``: optional, set it if running ipv6 packets. | |
93 | ||
94 | * ``--no-lthreads``: optional, disables l-thread model and uses EAL threading | |
95 | model. See below. | |
96 | ||
97 | * ``--stat-lcore``: optional, run CPU load stats collector on the specified | |
98 | lcore. | |
99 | ||
11fdf7f2 TL |
100 | * ``--parse-ptype:`` optional, set to use software to analyze packet type. |
101 | Without this option, hardware will check the packet type. | |
102 | ||
7c673cae FG |
103 | The parameters of the ``--rx`` and ``--tx`` options are: |
104 | ||
105 | * ``--rx`` parameters | |
106 | ||
107 | .. _table_l3fwd_rx_parameters: | |
108 | ||
109 | +--------+------------------------------------------------------+ | |
110 | | port | RX port | | |
111 | +--------+------------------------------------------------------+ | |
112 | | queue | RX queue that will be read on the specified RX port | | |
113 | +--------+------------------------------------------------------+ | |
114 | | lcore | Core to use for the thread | | |
115 | +--------+------------------------------------------------------+ | |
116 | | thread | Thread id (continuously from 0 to N) | | |
117 | +--------+------------------------------------------------------+ | |
118 | ||
119 | ||
120 | * ``--tx`` parameters | |
121 | ||
122 | .. _table_l3fwd_tx_parameters: | |
123 | ||
124 | +--------+------------------------------------------------------+ | |
125 | | lcore | Core to use for L3 route match and transmit | | |
126 | +--------+------------------------------------------------------+ | |
127 | | thread | Id of RX thread to be associated with this TX thread | | |
128 | +--------+------------------------------------------------------+ | |
129 | ||
130 | The ``l3fwd-thread`` application allows you to start packet processing in two | |
131 | threading models: L-Threads (default) and EAL Threads (when the | |
132 | ``--no-lthreads`` parameter is used). For consistency all parameters are used | |
133 | in the same way for both models. | |
134 | ||
135 | ||
136 | Running with L-threads | |
137 | ~~~~~~~~~~~~~~~~~~~~~~ | |
138 | ||
139 | When the L-thread model is used (default option), lcore and thread parameters | |
140 | in ``--rx/--tx`` are used to affinitize threads to the selected scheduler. | |
141 | ||
142 | For example, the following places every l-thread on different lcores:: | |
143 | ||
11fdf7f2 | 144 | l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ |
7c673cae FG |
145 | --rx="(0,0,0,0)(1,0,1,1)" \ |
146 | --tx="(2,0)(3,1)" | |
147 | ||
148 | The following places RX l-threads on lcore 0 and TX l-threads on lcore 1 and 2 | |
149 | and so on:: | |
150 | ||
11fdf7f2 | 151 | l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ |
7c673cae FG |
152 | --rx="(0,0,0,0)(1,0,0,1)" \ |
153 | --tx="(1,0)(2,1)" | |
154 | ||
155 | ||
156 | Running with EAL threads | |
157 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
158 | ||
159 | When the ``--no-lthreads`` parameter is used, the L-threading model is turned | |
160 | off and EAL threads are used for all processing. EAL threads are enumerated in | |
161 | the same way as L-threads, but the ``--lcores`` EAL parameter is used to | |
162 | affinitize threads to the selected cpu-set (scheduler). Thus it is possible to | |
163 | place every RX and TX thread on different lcores. | |
164 | ||
165 | For example, the following places every EAL thread on different lcores:: | |
166 | ||
11fdf7f2 | 167 | l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ |
7c673cae FG |
168 | --rx="(0,0,0,0)(1,0,1,1)" \ |
169 | --tx="(2,0)(3,1)" \ | |
170 | --no-lthreads | |
171 | ||
172 | ||
173 | To affinitize two or more EAL threads to one cpu-set, the EAL ``--lcores`` | |
174 | parameter is used. | |
175 | ||
176 | The following places RX EAL threads on lcore 0 and TX EAL threads on lcore 1 | |
177 | and 2 and so on:: | |
178 | ||
11fdf7f2 | 179 | l3fwd-thread -l 0-7 -n 2 --lcores="(0,1)@0,(2,3)@1" -- -P -p 3 \ |
7c673cae FG |
180 | --rx="(0,0,0,0)(1,0,1,1)" \ |
181 | --tx="(2,0)(3,1)" \ | |
182 | --no-lthreads | |
183 | ||
184 | ||
185 | Examples | |
186 | ~~~~~~~~ | |
187 | ||
188 | For selected scenarios the command line configuration of the application for L-threads | |
189 | and its corresponding EAL threads command line can be realized as follows: | |
190 | ||
191 | a) Start every thread on different scheduler (1:1):: | |
192 | ||
11fdf7f2 | 193 | l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ |
7c673cae FG |
194 | --rx="(0,0,0,0)(1,0,1,1)" \ |
195 | --tx="(2,0)(3,1)" | |
196 | ||
197 | EAL thread equivalent:: | |
198 | ||
11fdf7f2 | 199 | l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ |
7c673cae FG |
200 | --rx="(0,0,0,0)(1,0,1,1)" \ |
201 | --tx="(2,0)(3,1)" \ | |
202 | --no-lthreads | |
203 | ||
204 | b) Start all threads on one core (N:1). | |
205 | ||
206 | Start 4 L-threads on lcore 0:: | |
207 | ||
11fdf7f2 | 208 | l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ |
7c673cae FG |
209 | --rx="(0,0,0,0)(1,0,0,1)" \ |
210 | --tx="(0,0)(0,1)" | |
211 | ||
212 | Start 4 EAL threads on cpu-set 0:: | |
213 | ||
11fdf7f2 | 214 | l3fwd-thread -l 0-7 -n 2 --lcores="(0-3)@0" -- -P -p 3 \ |
7c673cae FG |
215 | --rx="(0,0,0,0)(1,0,0,1)" \ |
216 | --tx="(2,0)(3,1)" \ | |
217 | --no-lthreads | |
218 | ||
219 | c) Start threads on different cores (N:M). | |
220 | ||
221 | Start 2 L-threads for RX on lcore 0, and 2 L-threads for TX on lcore 1:: | |
222 | ||
11fdf7f2 | 223 | l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ |
7c673cae FG |
224 | --rx="(0,0,0,0)(1,0,0,1)" \ |
225 | --tx="(1,0)(1,1)" | |
226 | ||
227 | Start 2 EAL threads for RX on cpu-set 0, and 2 EAL threads for TX on | |
228 | cpu-set 1:: | |
229 | ||
11fdf7f2 | 230 | l3fwd-thread -l 0-7 -n 2 --lcores="(0-1)@0,(2-3)@1" -- -P -p 3 \ |
7c673cae FG |
231 | --rx="(0,0,0,0)(1,0,1,1)" \ |
232 | --tx="(2,0)(3,1)" \ | |
233 | --no-lthreads | |
234 | ||
235 | Explanation | |
236 | ----------- | |
237 | ||
238 | To a great extent the sample application differs little from the standard L3 | |
239 | forwarding application, and readers are advised to familiarize themselves with | |
240 | the material covered in the :doc:`l3_forward` documentation before proceeding. | |
241 | ||
242 | The following explanation is focused on the way threading is handled in the | |
243 | performance thread example. | |
244 | ||
245 | ||
246 | Mode of operation with EAL threads | |
247 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
248 | ||
249 | The performance thread sample application has split the RX and TX functionality | |
250 | into two different threads, and the RX and TX threads are | |
251 | interconnected via software rings. With respect to these rings the RX threads | |
252 | are producers and the TX threads are consumers. | |
253 | ||
254 | On initialization the TX and RX threads are started according to the command | |
255 | line parameters. | |
256 | ||
257 | The RX threads poll the network interface queues and post received packets to a | |
258 | TX thread via a corresponding software ring. | |
259 | ||
260 | The TX threads poll software rings, perform the L3 forwarding hash/LPM match, | |
261 | and assemble packet bursts before performing burst transmit on the network | |
262 | interface. | |
263 | ||
264 | As with the standard L3 forward application, burst draining of residual packets | |
265 | is performed periodically with the period calculated from elapsed time using | |
266 | the timestamps counter. | |
267 | ||
268 | The diagram below illustrates a case with two RX threads and three TX threads. | |
269 | ||
270 | .. _figure_performance_thread_1: | |
271 | ||
272 | .. figure:: img/performance_thread_1.* | |
273 | ||
274 | ||
275 | Mode of operation with L-threads | |
276 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
277 | ||
278 | Like the EAL thread configuration the application has split the RX and TX | |
279 | functionality into different threads, and the pairs of RX and TX threads are | |
280 | interconnected via software rings. | |
281 | ||
282 | On initialization an L-thread scheduler is started on every EAL thread. On all | |
283 | but the master EAL thread only a a dummy L-thread is initially started. | |
284 | The L-thread started on the master EAL thread then spawns other L-threads on | |
9f95a23c | 285 | different L-thread schedulers according the command line parameters. |
7c673cae FG |
286 | |
287 | The RX threads poll the network interface queues and post received packets | |
288 | to a TX thread via the corresponding software ring. | |
289 | ||
290 | The ring interface is augmented by means of an L-thread condition variable that | |
291 | enables the TX thread to be suspended when the TX ring is empty. The RX thread | |
292 | signals the condition whenever it posts to the TX ring, causing the TX thread | |
293 | to be resumed. | |
294 | ||
295 | Additionally the TX L-thread spawns a worker L-thread to take care of | |
296 | polling the software rings, whilst it handles burst draining of the transmit | |
297 | buffer. | |
298 | ||
299 | The worker threads poll the software rings, perform L3 route lookup and | |
300 | assemble packet bursts. If the TX ring is empty the worker thread suspends | |
301 | itself by waiting on the condition variable associated with the ring. | |
302 | ||
303 | Burst draining of residual packets, less than the burst size, is performed by | |
304 | the TX thread which sleeps (using an L-thread sleep function) and resumes | |
305 | periodically to flush the TX buffer. | |
306 | ||
307 | This design means that L-threads that have no work, can yield the CPU to other | |
308 | L-threads and avoid having to constantly poll the software rings. | |
309 | ||
310 | The diagram below illustrates a case with two RX threads and three TX functions | |
311 | (each comprising a thread that processes forwarding and a thread that | |
312 | periodically drains the output buffer of residual packets). | |
313 | ||
314 | .. _figure_performance_thread_2: | |
315 | ||
316 | .. figure:: img/performance_thread_2.* | |
317 | ||
318 | ||
319 | CPU load statistics | |
320 | ~~~~~~~~~~~~~~~~~~~ | |
321 | ||
322 | It is possible to display statistics showing estimated CPU load on each core. | |
323 | The statistics indicate the percentage of CPU time spent: processing | |
324 | received packets (forwarding), polling queues/rings (waiting for work), | |
325 | and doing any other processing (context switch and other overhead). | |
326 | ||
327 | When enabled statistics are gathered by having the application threads set and | |
328 | clear flags when they enter and exit pertinent code sections. The flags are | |
329 | then sampled in real time by a statistics collector thread running on another | |
330 | core. This thread displays the data in real time on the console. | |
331 | ||
332 | This feature is enabled by designating a statistics collector core, using the | |
333 | ``--stat-lcore`` parameter. | |
334 | ||
335 | ||
336 | .. _lthread_subsystem: | |
337 | ||
338 | The L-thread subsystem | |
339 | ---------------------- | |
340 | ||
341 | The L-thread subsystem resides in the examples/performance-thread/common | |
342 | directory and is built and linked automatically when building the | |
343 | ``l3fwd-thread`` example. | |
344 | ||
345 | The subsystem provides a simple cooperative scheduler to enable arbitrary | |
346 | functions to run as cooperative threads within a single EAL thread. | |
347 | The subsystem provides a pthread like API that is intended to assist in | |
348 | reuse of legacy code written for POSIX pthreads. | |
349 | ||
350 | The following sections provide some detail on the features, constraints, | |
351 | performance and porting considerations when using L-threads. | |
352 | ||
353 | ||
354 | .. _comparison_between_lthreads_and_pthreads: | |
355 | ||
356 | Comparison between L-threads and POSIX pthreads | |
357 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
358 | ||
359 | The fundamental difference between the L-thread and pthread models is the | |
360 | way in which threads are scheduled. The simplest way to think about this is to | |
361 | consider the case of a processor with a single CPU. To run multiple threads | |
362 | on a single CPU, the scheduler must frequently switch between the threads, | |
363 | in order that each thread is able to make timely progress. | |
364 | This is the basis of any multitasking operating system. | |
365 | ||
366 | This section explores the differences between the pthread model and the | |
367 | L-thread model as implemented in the provided L-thread subsystem. If needed a | |
368 | theoretical discussion of preemptive vs cooperative multi-threading can be | |
369 | found in any good text on operating system design. | |
370 | ||
371 | ||
372 | Scheduling and context switching | |
373 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
374 | ||
375 | The POSIX pthread library provides an application programming interface to | |
376 | create and synchronize threads. Scheduling policy is determined by the host OS, | |
377 | and may be configurable. The OS may use sophisticated rules to determine which | |
378 | thread should be run next, threads may suspend themselves or make other threads | |
379 | ready, and the scheduler may employ a time slice giving each thread a maximum | |
380 | time quantum after which it will be preempted in favor of another thread that | |
381 | is ready to run. To complicate matters further threads may be assigned | |
382 | different scheduling priorities. | |
383 | ||
384 | By contrast the L-thread subsystem is considerably simpler. Logically the | |
385 | L-thread scheduler performs the same multiplexing function for L-threads | |
386 | within a single pthread as the OS scheduler does for pthreads within an | |
387 | application process. The L-thread scheduler is simply the main loop of a | |
388 | pthread, and in so far as the host OS is concerned it is a regular pthread | |
389 | just like any other. The host OS is oblivious about the existence of and | |
390 | not at all involved in the scheduling of L-threads. | |
391 | ||
392 | The other and most significant difference between the two models is that | |
393 | L-threads are scheduled cooperatively. L-threads cannot not preempt each | |
394 | other, nor can the L-thread scheduler preempt a running L-thread (i.e. | |
395 | there is no time slicing). The consequence is that programs implemented with | |
396 | L-threads must possess frequent rescheduling points, meaning that they must | |
397 | explicitly and of their own volition return to the scheduler at frequent | |
398 | intervals, in order to allow other L-threads an opportunity to proceed. | |
399 | ||
400 | In both models switching between threads requires that the current CPU | |
401 | context is saved and a new context (belonging to the next thread ready to run) | |
402 | is restored. With pthreads this context switching is handled transparently | |
403 | and the set of CPU registers that must be preserved between context switches | |
404 | is as per an interrupt handler. | |
405 | ||
406 | An L-thread context switch is achieved by the thread itself making a function | |
407 | call to the L-thread scheduler. Thus it is only necessary to preserve the | |
408 | callee registers. The caller is responsible to save and restore any other | |
409 | registers it is using before a function call, and restore them on return, | |
410 | and this is handled by the compiler. For ``X86_64`` on both Linux and BSD the | |
411 | System V calling convention is used, this defines registers RSP, RBP, and | |
412 | R12-R15 as callee-save registers (for more detailed discussion a good reference | |
413 | is `X86 Calling Conventions <https://en.wikipedia.org/wiki/X86_calling_conventions>`_). | |
414 | ||
415 | Taking advantage of this, and due to the absence of preemption, an L-thread | |
416 | context switch is achieved with less than 20 load/store instructions. | |
417 | ||
418 | The scheduling policy for L-threads is fixed, there is no prioritization of | |
419 | L-threads, all L-threads are equal and scheduling is based on a FIFO | |
420 | ready queue. | |
421 | ||
422 | An L-thread is a struct containing the CPU context of the thread | |
423 | (saved on context switch) and other useful items. The ready queue contains | |
424 | pointers to threads that are ready to run. The L-thread scheduler is a simple | |
425 | loop that polls the ready queue, reads from it the next thread ready to run, | |
426 | which it resumes by saving the current context (the current position in the | |
427 | scheduler loop) and restoring the context of the next thread from its thread | |
428 | struct. Thus an L-thread is always resumed at the last place it yielded. | |
429 | ||
430 | A well behaved L-thread will call the context switch regularly (at least once | |
431 | in its main loop) thus returning to the scheduler's own main loop. Yielding | |
432 | inserts the current thread at the back of the ready queue, and the process of | |
433 | servicing the ready queue is repeated, thus the system runs by flipping back | |
434 | and forth the between L-threads and scheduler loop. | |
435 | ||
436 | In the case of pthreads, the preemptive scheduling, time slicing, and support | |
437 | for thread prioritization means that progress is normally possible for any | |
438 | thread that is ready to run. This comes at the price of a relatively heavier | |
439 | context switch and scheduling overhead. | |
440 | ||
441 | With L-threads the progress of any particular thread is determined by the | |
442 | frequency of rescheduling opportunities in the other L-threads. This means that | |
443 | an errant L-thread monopolizing the CPU might cause scheduling of other threads | |
444 | to be stalled. Due to the lower cost of context switching, however, voluntary | |
445 | rescheduling to ensure progress of other threads, if managed sensibly, is not | |
446 | a prohibitive overhead, and overall performance can exceed that of an | |
447 | application using pthreads. | |
448 | ||
449 | ||
450 | Mutual exclusion | |
451 | ^^^^^^^^^^^^^^^^ | |
452 | ||
453 | With pthreads preemption means that threads that share data must observe | |
454 | some form of mutual exclusion protocol. | |
455 | ||
456 | The fact that L-threads cannot preempt each other means that in many cases | |
457 | mutual exclusion devices can be completely avoided. | |
458 | ||
459 | Locking to protect shared data can be a significant bottleneck in | |
460 | multi-threaded applications so a carefully designed cooperatively scheduled | |
461 | program can enjoy significant performance advantages. | |
462 | ||
463 | So far we have considered only the simplistic case of a single core CPU, | |
464 | when multiple CPUs are considered things are somewhat more complex. | |
465 | ||
466 | First of all it is inevitable that there must be multiple L-thread schedulers, | |
467 | one running on each EAL thread. So long as these schedulers remain isolated | |
468 | from each other the above assertions about the potential advantages of | |
469 | cooperative scheduling hold true. | |
470 | ||
471 | A configuration with isolated cooperative schedulers is less flexible than the | |
472 | pthread model where threads can be affinitized to run on any CPU. With isolated | |
473 | schedulers scaling of applications to utilize fewer or more CPUs according to | |
474 | system demand is very difficult to achieve. | |
475 | ||
476 | The L-thread subsystem makes it possible for L-threads to migrate between | |
477 | schedulers running on different CPUs. Needless to say if the migration means | |
478 | that threads that share data end up running on different CPUs then this will | |
479 | introduce the need for some kind of mutual exclusion system. | |
480 | ||
481 | Of course ``rte_ring`` software rings can always be used to interconnect | |
482 | threads running on different cores, however to protect other kinds of shared | |
483 | data structures, lock free constructs or else explicit locking will be | |
484 | required. This is a consideration for the application design. | |
485 | ||
486 | In support of this extended functionality, the L-thread subsystem implements | |
487 | thread safe mutexes and condition variables. | |
488 | ||
489 | The cost of affinitizing and of condition variable signaling is significantly | |
490 | lower than the equivalent pthread operations, and so applications using these | |
491 | features will see a performance benefit. | |
492 | ||
493 | ||
494 | Thread local storage | |
495 | ^^^^^^^^^^^^^^^^^^^^ | |
496 | ||
497 | As with applications written for pthreads an application written for L-threads | |
498 | can take advantage of thread local storage, in this case local to an L-thread. | |
499 | An application may save and retrieve a single pointer to application data in | |
500 | the L-thread struct. | |
501 | ||
502 | For legacy and backward compatibility reasons two alternative methods are also | |
9f95a23c TL |
503 | offered, the first is modeled directly on the pthread get/set specific APIs, |
504 | the second approach is modeled on the ``RTE_PER_LCORE`` macros, whereby | |
7c673cae FG |
505 | ``PER_LTHREAD`` macros are introduced, in both cases the storage is local to |
506 | the L-thread. | |
507 | ||
508 | ||
509 | .. _constraints_and_performance_implications: | |
510 | ||
511 | Constraints and performance implications when using L-threads | |
512 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
513 | ||
514 | ||
515 | .. _API_compatibility: | |
516 | ||
517 | API compatibility | |
518 | ^^^^^^^^^^^^^^^^^ | |
519 | ||
520 | The L-thread subsystem provides a set of functions that are logically equivalent | |
521 | to the corresponding functions offered by the POSIX pthread library, however not | |
522 | all pthread functions have a corresponding L-thread equivalent, and not all | |
523 | features available to pthreads are implemented for L-threads. | |
524 | ||
525 | The pthread library offers considerable flexibility via programmable attributes | |
526 | that can be associated with threads, mutexes, and condition variables. | |
527 | ||
528 | By contrast the L-thread subsystem has fixed functionality, the scheduler policy | |
529 | cannot be varied, and L-threads cannot be prioritized. There are no variable | |
530 | attributes associated with any L-thread objects. L-threads, mutexes and | |
531 | conditional variables, all have fixed functionality. (Note: reserved parameters | |
532 | are included in the APIs to facilitate possible future support for attributes). | |
533 | ||
534 | The table below lists the pthread and equivalent L-thread APIs with notes on | |
535 | differences and/or constraints. Where there is no L-thread entry in the table, | |
536 | then the L-thread subsystem provides no equivalent function. | |
537 | ||
538 | .. _table_lthread_pthread: | |
539 | ||
540 | .. table:: Pthread and equivalent L-thread APIs. | |
541 | ||
542 | +----------------------------+------------------------+-------------------+ | |
543 | | **Pthread function** | **L-thread function** | **Notes** | | |
544 | +============================+========================+===================+ | |
545 | | pthread_barrier_destroy | | | | |
546 | +----------------------------+------------------------+-------------------+ | |
547 | | pthread_barrier_init | | | | |
548 | +----------------------------+------------------------+-------------------+ | |
549 | | pthread_barrier_wait | | | | |
550 | +----------------------------+------------------------+-------------------+ | |
551 | | pthread_cond_broadcast | lthread_cond_broadcast | See note 1 | | |
552 | +----------------------------+------------------------+-------------------+ | |
553 | | pthread_cond_destroy | lthread_cond_destroy | | | |
554 | +----------------------------+------------------------+-------------------+ | |
555 | | pthread_cond_init | lthread_cond_init | | | |
556 | +----------------------------+------------------------+-------------------+ | |
557 | | pthread_cond_signal | lthread_cond_signal | See note 1 | | |
558 | +----------------------------+------------------------+-------------------+ | |
559 | | pthread_cond_timedwait | | | | |
560 | +----------------------------+------------------------+-------------------+ | |
561 | | pthread_cond_wait | lthread_cond_wait | See note 5 | | |
562 | +----------------------------+------------------------+-------------------+ | |
563 | | pthread_create | lthread_create | See notes 2, 3 | | |
564 | +----------------------------+------------------------+-------------------+ | |
565 | | pthread_detach | lthread_detach | See note 4 | | |
566 | +----------------------------+------------------------+-------------------+ | |
567 | | pthread_equal | | | | |
568 | +----------------------------+------------------------+-------------------+ | |
569 | | pthread_exit | lthread_exit | | | |
570 | +----------------------------+------------------------+-------------------+ | |
571 | | pthread_getspecific | lthread_getspecific | | | |
572 | +----------------------------+------------------------+-------------------+ | |
573 | | pthread_getcpuclockid | | | | |
574 | +----------------------------+------------------------+-------------------+ | |
575 | | pthread_join | lthread_join | | | |
576 | +----------------------------+------------------------+-------------------+ | |
577 | | pthread_key_create | lthread_key_create | | | |
578 | +----------------------------+------------------------+-------------------+ | |
579 | | pthread_key_delete | lthread_key_delete | | | |
580 | +----------------------------+------------------------+-------------------+ | |
581 | | pthread_mutex_destroy | lthread_mutex_destroy | | | |
582 | +----------------------------+------------------------+-------------------+ | |
583 | | pthread_mutex_init | lthread_mutex_init | | | |
584 | +----------------------------+------------------------+-------------------+ | |
585 | | pthread_mutex_lock | lthread_mutex_lock | See note 6 | | |
586 | +----------------------------+------------------------+-------------------+ | |
587 | | pthread_mutex_trylock | lthread_mutex_trylock | See note 6 | | |
588 | +----------------------------+------------------------+-------------------+ | |
589 | | pthread_mutex_timedlock | | | | |
590 | +----------------------------+------------------------+-------------------+ | |
591 | | pthread_mutex_unlock | lthread_mutex_unlock | | | |
592 | +----------------------------+------------------------+-------------------+ | |
593 | | pthread_once | | | | |
594 | +----------------------------+------------------------+-------------------+ | |
595 | | pthread_rwlock_destroy | | | | |
596 | +----------------------------+------------------------+-------------------+ | |
597 | | pthread_rwlock_init | | | | |
598 | +----------------------------+------------------------+-------------------+ | |
599 | | pthread_rwlock_rdlock | | | | |
600 | +----------------------------+------------------------+-------------------+ | |
601 | | pthread_rwlock_timedrdlock | | | | |
602 | +----------------------------+------------------------+-------------------+ | |
603 | | pthread_rwlock_timedwrlock | | | | |
604 | +----------------------------+------------------------+-------------------+ | |
605 | | pthread_rwlock_tryrdlock | | | | |
606 | +----------------------------+------------------------+-------------------+ | |
607 | | pthread_rwlock_trywrlock | | | | |
608 | +----------------------------+------------------------+-------------------+ | |
609 | | pthread_rwlock_unlock | | | | |
610 | +----------------------------+------------------------+-------------------+ | |
611 | | pthread_rwlock_wrlock | | | | |
612 | +----------------------------+------------------------+-------------------+ | |
613 | | pthread_self | lthread_current | | | |
614 | +----------------------------+------------------------+-------------------+ | |
615 | | pthread_setspecific | lthread_setspecific | | | |
616 | +----------------------------+------------------------+-------------------+ | |
617 | | pthread_spin_init | | See note 10 | | |
618 | +----------------------------+------------------------+-------------------+ | |
619 | | pthread_spin_destroy | | See note 10 | | |
620 | +----------------------------+------------------------+-------------------+ | |
621 | | pthread_spin_lock | | See note 10 | | |
622 | +----------------------------+------------------------+-------------------+ | |
623 | | pthread_spin_trylock | | See note 10 | | |
624 | +----------------------------+------------------------+-------------------+ | |
625 | | pthread_spin_unlock | | See note 10 | | |
626 | +----------------------------+------------------------+-------------------+ | |
627 | | pthread_cancel | lthread_cancel | | | |
628 | +----------------------------+------------------------+-------------------+ | |
629 | | pthread_setcancelstate | | | | |
630 | +----------------------------+------------------------+-------------------+ | |
631 | | pthread_setcanceltype | | | | |
632 | +----------------------------+------------------------+-------------------+ | |
633 | | pthread_testcancel | | | | |
634 | +----------------------------+------------------------+-------------------+ | |
635 | | pthread_getschedparam | | | | |
636 | +----------------------------+------------------------+-------------------+ | |
637 | | pthread_setschedparam | | | | |
638 | +----------------------------+------------------------+-------------------+ | |
639 | | pthread_yield | lthread_yield | See note 7 | | |
640 | +----------------------------+------------------------+-------------------+ | |
641 | | pthread_setaffinity_np | lthread_set_affinity | See notes 2, 3, 8 | | |
642 | +----------------------------+------------------------+-------------------+ | |
643 | | | lthread_sleep | See note 9 | | |
644 | +----------------------------+------------------------+-------------------+ | |
645 | | | lthread_sleep_clks | See note 9 | | |
646 | +----------------------------+------------------------+-------------------+ | |
647 | ||
648 | ||
649 | **Note 1**: | |
650 | ||
651 | Neither lthread signal nor broadcast may be called concurrently by L-threads | |
652 | running on different schedulers, although multiple L-threads running in the | |
653 | same scheduler may freely perform signal or broadcast operations. L-threads | |
654 | running on the same or different schedulers may always safely wait on a | |
655 | condition variable. | |
656 | ||
657 | ||
658 | **Note 2**: | |
659 | ||
660 | Pthread attributes may be used to affinitize a pthread with a cpu-set. The | |
661 | L-thread subsystem does not support a cpu-set. An L-thread may be affinitized | |
662 | only with a single CPU at any time. | |
663 | ||
664 | ||
665 | **Note 3**: | |
666 | ||
667 | If an L-thread is intended to run on a different NUMA node than the node that | |
668 | creates the thread then, when calling ``lthread_create()`` it is advantageous | |
669 | to specify the destination core as a parameter of ``lthread_create()``. See | |
670 | :ref:`memory_allocation_and_NUMA_awareness` for details. | |
671 | ||
672 | ||
673 | **Note 4**: | |
674 | ||
675 | An L-thread can only detach itself, and cannot detach other L-threads. | |
676 | ||
677 | ||
678 | **Note 5**: | |
679 | ||
680 | A wait operation on a pthread condition variable is always associated with and | |
681 | protected by a mutex which must be owned by the thread at the time it invokes | |
682 | ``pthread_wait()``. By contrast L-thread condition variables are thread safe | |
683 | (for waiters) and do not use an associated mutex. Multiple L-threads (including | |
684 | L-threads running on other schedulers) can safely wait on a L-thread condition | |
685 | variable. As a consequence the performance of an L-thread condition variables | |
686 | is typically an order of magnitude faster than its pthread counterpart. | |
687 | ||
688 | ||
689 | **Note 6**: | |
690 | ||
691 | Recursive locking is not supported with L-threads, attempts to take a lock | |
692 | recursively will be detected and rejected. | |
693 | ||
694 | ||
695 | **Note 7**: | |
696 | ||
697 | ``lthread_yield()`` will save the current context, insert the current thread | |
698 | to the back of the ready queue, and resume the next ready thread. Yielding | |
699 | increases ready queue backlog, see :ref:`ready_queue_backlog` for more details | |
700 | about the implications of this. | |
701 | ||
702 | ||
703 | N.B. The context switch time as measured from immediately before the call to | |
704 | ``lthread_yield()`` to the point at which the next ready thread is resumed, | |
705 | can be an order of magnitude faster that the same measurement for | |
706 | pthread_yield. | |
707 | ||
708 | ||
709 | **Note 8**: | |
710 | ||
711 | ``lthread_set_affinity()`` is similar to a yield apart from the fact that the | |
712 | yielding thread is inserted into a peer ready queue of another scheduler. | |
713 | The peer ready queue is actually a separate thread safe queue, which means that | |
714 | threads appearing in the peer ready queue can jump any backlog in the local | |
715 | ready queue on the destination scheduler. | |
716 | ||
717 | The context switch time as measured from the time just before the call to | |
718 | ``lthread_set_affinity()`` to just after the same thread is resumed on the new | |
719 | scheduler can be orders of magnitude faster than the same measurement for | |
720 | ``pthread_setaffinity_np()``. | |
721 | ||
722 | ||
723 | **Note 9**: | |
724 | ||
725 | Although there is no ``pthread_sleep()`` function, ``lthread_sleep()`` and | |
726 | ``lthread_sleep_clks()`` can be used wherever ``sleep()``, ``usleep()`` or | |
727 | ``nanosleep()`` might ordinarily be used. The L-thread sleep functions suspend | |
728 | the current thread, start an ``rte_timer`` and resume the thread when the | |
729 | timer matures. The ``rte_timer_manage()`` entry point is called on every pass | |
730 | of the scheduler loop. This means that the worst case jitter on timer expiry | |
731 | is determined by the longest period between context switches of any running | |
732 | L-threads. | |
733 | ||
734 | In a synthetic test with many threads sleeping and resuming then the measured | |
735 | jitter is typically orders of magnitude lower than the same measurement made | |
736 | for ``nanosleep()``. | |
737 | ||
738 | ||
739 | **Note 10**: | |
740 | ||
741 | Spin locks are not provided because they are problematical in a cooperative | |
742 | environment, see :ref:`porting_locks_and_spinlocks` for a more detailed | |
743 | discussion on how to avoid spin locks. | |
744 | ||
745 | ||
746 | .. _Thread_local_storage_performance: | |
747 | ||
748 | Thread local storage | |
749 | ^^^^^^^^^^^^^^^^^^^^ | |
750 | ||
751 | Of the three L-thread local storage options the simplest and most efficient is | |
752 | storing a single application data pointer in the L-thread struct. | |
753 | ||
754 | The ``PER_LTHREAD`` macros involve a run time computation to obtain the address | |
755 | of the variable being saved/retrieved and also require that the accesses are | |
756 | de-referenced via a pointer. This means that code that has used | |
757 | ``RTE_PER_LCORE`` macros being ported to L-threads might need some slight | |
758 | adjustment (see :ref:`porting_thread_local_storage` for hints about porting | |
759 | code that makes use of thread local storage). | |
760 | ||
761 | The get/set specific APIs are consistent with their pthread counterparts both | |
762 | in use and in performance. | |
763 | ||
764 | ||
765 | .. _memory_allocation_and_NUMA_awareness: | |
766 | ||
767 | Memory allocation and NUMA awareness | |
768 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
769 | ||
770 | All memory allocation is from DPDK huge pages, and is NUMA aware. Each | |
771 | scheduler maintains its own caches of objects: lthreads, their stacks, TLS, | |
772 | mutexes and condition variables. These caches are implemented as unbounded lock | |
773 | free MPSC queues. When objects are created they are always allocated from the | |
774 | caches on the local core (current EAL thread). | |
775 | ||
776 | If an L-thread has been affinitized to a different scheduler, then it can | |
777 | always safely free resources to the caches from which they originated (because | |
778 | the caches are MPSC queues). | |
779 | ||
780 | If the L-thread has been affinitized to a different NUMA node then the memory | |
781 | resources associated with it may incur longer access latency. | |
782 | ||
783 | The commonly used pattern of setting affinity on entry to a thread after it has | |
784 | started, means that memory allocation for both the stack and TLS will have been | |
785 | made from caches on the NUMA node on which the threads creator is running. | |
786 | This has the side effect that access latency will be sub-optimal after | |
787 | affinitizing. | |
788 | ||
789 | This side effect can be mitigated to some extent (although not completely) by | |
790 | specifying the destination CPU as a parameter of ``lthread_create()`` this | |
791 | causes the L-thread's stack and TLS to be allocated when it is first scheduled | |
792 | on the destination scheduler, if the destination is a on another NUMA node it | |
793 | results in a more optimal memory allocation. | |
794 | ||
795 | Note that the lthread struct itself remains allocated from memory on the | |
796 | creating node, this is unavoidable because an L-thread is known everywhere by | |
797 | the address of this struct. | |
798 | ||
799 | ||
800 | .. _object_cache_sizing: | |
801 | ||
802 | Object cache sizing | |
803 | ^^^^^^^^^^^^^^^^^^^ | |
804 | ||
805 | The per lcore object caches pre-allocate objects in bulk whenever a request to | |
806 | allocate an object finds a cache empty. By default 100 objects are | |
807 | pre-allocated, this is defined by ``LTHREAD_PREALLOC`` in the public API | |
808 | header file lthread_api.h. This means that the caches constantly grow to meet | |
809 | system demand. | |
810 | ||
811 | In the present implementation there is no mechanism to reduce the cache sizes | |
812 | if system demand reduces. Thus the caches will remain at their maximum extent | |
813 | indefinitely. | |
814 | ||
815 | A consequence of the bulk pre-allocation of objects is that every 100 (default | |
816 | value) additional new object create operations results in a call to | |
817 | ``rte_malloc()``. For creation of objects such as L-threads, which trigger the | |
818 | allocation of even more objects (i.e. their stacks and TLS) then this can | |
819 | cause outliers in scheduling performance. | |
820 | ||
821 | If this is a problem the simplest mitigation strategy is to dimension the | |
822 | system, by setting the bulk object pre-allocation size to some large number | |
823 | that you do not expect to be exceeded. This means the caches will be populated | |
824 | once only, the very first time a thread is created. | |
825 | ||
826 | ||
827 | .. _Ready_queue_backlog: | |
828 | ||
829 | Ready queue backlog | |
830 | ^^^^^^^^^^^^^^^^^^^ | |
831 | ||
832 | One of the more subtle performance considerations is managing the ready queue | |
833 | backlog. The fewer threads that are waiting in the ready queue then the faster | |
834 | any particular thread will get serviced. | |
835 | ||
836 | In a naive L-thread application with N L-threads simply looping and yielding, | |
837 | this backlog will always be equal to the number of L-threads, thus the cost of | |
838 | a yield to a particular L-thread will be N times the context switch time. | |
839 | ||
840 | This side effect can be mitigated by arranging for threads to be suspended and | |
841 | wait to be resumed, rather than polling for work by constantly yielding. | |
842 | Blocking on a mutex or condition variable or even more obviously having a | |
843 | thread sleep if it has a low frequency workload are all mechanisms by which a | |
844 | thread can be excluded from the ready queue until it really does need to be | |
845 | run. This can have a significant positive impact on performance. | |
846 | ||
847 | ||
848 | .. _Initialization_and_shutdown_dependencies: | |
849 | ||
850 | Initialization, shutdown and dependencies | |
851 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
852 | ||
853 | The L-thread subsystem depends on DPDK for huge page allocation and depends on | |
854 | the ``rte_timer subsystem``. The DPDK EAL initialization and | |
855 | ``rte_timer_subsystem_init()`` **MUST** be completed before the L-thread sub | |
856 | system can be used. | |
857 | ||
858 | Thereafter initialization of the L-thread subsystem is largely transparent to | |
859 | the application. Constructor functions ensure that global variables are properly | |
860 | initialized. Other than global variables each scheduler is initialized | |
861 | independently the first time that an L-thread is created by a particular EAL | |
862 | thread. | |
863 | ||
864 | If the schedulers are to be run as isolated and independent schedulers, with | |
865 | no intention that L-threads running on different schedulers will migrate between | |
866 | schedulers or synchronize with L-threads running on other schedulers, then | |
867 | initialization consists simply of creating an L-thread, and then running the | |
868 | L-thread scheduler. | |
869 | ||
870 | If there will be interaction between L-threads running on different schedulers, | |
871 | then it is important that the starting of schedulers on different EAL threads | |
872 | is synchronized. | |
873 | ||
874 | To achieve this an additional initialization step is necessary, this is simply | |
875 | to set the number of schedulers by calling the API function | |
876 | ``lthread_num_schedulers_set(n)``, where ``n`` is the number of EAL threads | |
877 | that will run L-thread schedulers. Setting the number of schedulers to a | |
878 | number greater than 0 will cause all schedulers to wait until the others have | |
879 | started before beginning to schedule L-threads. | |
880 | ||
881 | The L-thread scheduler is started by calling the function ``lthread_run()`` | |
882 | and should be called from the EAL thread and thus become the main loop of the | |
883 | EAL thread. | |
884 | ||
885 | The function ``lthread_run()``, will not return until all threads running on | |
886 | the scheduler have exited, and the scheduler has been explicitly stopped by | |
887 | calling ``lthread_scheduler_shutdown(lcore)`` or | |
888 | ``lthread_scheduler_shutdown_all()``. | |
889 | ||
890 | All these function do is tell the scheduler that it can exit when there are no | |
891 | longer any running L-threads, neither function forces any running L-thread to | |
892 | terminate. Any desired application shutdown behavior must be designed and | |
893 | built into the application to ensure that L-threads complete in a timely | |
894 | manner. | |
895 | ||
896 | **Important Note:** It is assumed when the scheduler exits that the application | |
897 | is terminating for good, the scheduler does not free resources before exiting | |
898 | and running the scheduler a subsequent time will result in undefined behavior. | |
899 | ||
900 | ||
901 | .. _porting_legacy_code_to_run_on_lthreads: | |
902 | ||
903 | Porting legacy code to run on L-threads | |
904 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
905 | ||
906 | Legacy code originally written for a pthread environment may be ported to | |
907 | L-threads if the considerations about differences in scheduling policy, and | |
908 | constraints discussed in the previous sections can be accommodated. | |
909 | ||
910 | This section looks in more detail at some of the issues that may have to be | |
911 | resolved when porting code. | |
912 | ||
913 | ||
914 | .. _pthread_API_compatibility: | |
915 | ||
916 | pthread API compatibility | |
917 | ^^^^^^^^^^^^^^^^^^^^^^^^^ | |
918 | ||
919 | The first step is to establish exactly which pthread APIs the legacy | |
920 | application uses, and to understand the requirements of those APIs. If there | |
921 | are corresponding L-lthread APIs, and where the default pthread functionality | |
922 | is used by the application then, notwithstanding the other issues discussed | |
923 | here, it should be feasible to run the application with L-threads. If the | |
924 | legacy code modifies the default behavior using attributes then if may be | |
925 | necessary to make some adjustments to eliminate those requirements. | |
926 | ||
927 | ||
928 | .. _blocking_system_calls: | |
929 | ||
930 | Blocking system API calls | |
931 | ^^^^^^^^^^^^^^^^^^^^^^^^^ | |
932 | ||
933 | It is important to understand what other system services the application may be | |
934 | using, bearing in mind that in a cooperatively scheduled environment a thread | |
935 | cannot block without stalling the scheduler and with it all other cooperative | |
936 | threads. Any kind of blocking system call, for example file or socket IO, is a | |
937 | potential problem, a good tool to analyze the application for this purpose is | |
938 | the ``strace`` utility. | |
939 | ||
940 | There are many strategies to resolve these kind of issues, each with it | |
941 | merits. Possible solutions include: | |
942 | ||
943 | * Adopting a polled mode of the system API concerned (if available). | |
944 | ||
945 | * Arranging for another core to perform the function and synchronizing with | |
946 | that core via constructs that will not block the L-thread. | |
947 | ||
948 | * Affinitizing the thread to another scheduler devoted (as a matter of policy) | |
949 | to handling threads wishing to make blocking calls, and then back again when | |
950 | finished. | |
951 | ||
952 | ||
953 | .. _porting_locks_and_spinlocks: | |
954 | ||
955 | Locks and spinlocks | |
956 | ^^^^^^^^^^^^^^^^^^^ | |
957 | ||
958 | Locks and spinlocks are another source of blocking behavior that for the same | |
959 | reasons as system calls will need to be addressed. | |
960 | ||
961 | If the application design ensures that the contending L-threads will always | |
962 | run on the same scheduler then it its probably safe to remove locks and spin | |
963 | locks completely. | |
964 | ||
965 | The only exception to the above rule is if for some reason the | |
966 | code performs any kind of context switch whilst holding the lock | |
967 | (e.g. yield, sleep, or block on a different lock, or on a condition variable). | |
968 | This will need to determined before deciding to eliminate a lock. | |
969 | ||
970 | If a lock cannot be eliminated then an L-thread mutex can be substituted for | |
971 | either kind of lock. | |
972 | ||
973 | An L-thread blocking on an L-thread mutex will be suspended and will cause | |
974 | another ready L-thread to be resumed, thus not blocking the scheduler. When | |
975 | default behavior is required, it can be used as a direct replacement for a | |
976 | pthread mutex lock. | |
977 | ||
978 | Spin locks are typically used when lock contention is likely to be rare and | |
979 | where the period during which the lock may be held is relatively short. | |
980 | When the contending L-threads are running on the same scheduler then an | |
981 | L-thread blocking on a spin lock will enter an infinite loop stopping the | |
982 | scheduler completely (see :ref:`porting_infinite_loops` below). | |
983 | ||
984 | If the application design ensures that contending L-threads will always run | |
985 | on different schedulers then it might be reasonable to leave a short spin lock | |
986 | that rarely experiences contention in place. | |
987 | ||
988 | If after all considerations it appears that a spin lock can neither be | |
989 | eliminated completely, replaced with an L-thread mutex, or left in place as | |
990 | is, then an alternative is to loop on a flag, with a call to | |
991 | ``lthread_yield()`` inside the loop (n.b. if the contending L-threads might | |
992 | ever run on different schedulers the flag will need to be manipulated | |
993 | atomically). | |
994 | ||
995 | Spinning and yielding is the least preferred solution since it introduces | |
996 | ready queue backlog (see also :ref:`ready_queue_backlog`). | |
997 | ||
998 | ||
999 | .. _porting_sleeps_and_delays: | |
1000 | ||
1001 | Sleeps and delays | |
1002 | ^^^^^^^^^^^^^^^^^ | |
1003 | ||
1004 | Yet another kind of blocking behavior (albeit momentary) are delay functions | |
1005 | like ``sleep()``, ``usleep()``, ``nanosleep()`` etc. All will have the | |
1006 | consequence of stalling the L-thread scheduler and unless the delay is very | |
1007 | short (e.g. a very short nanosleep) calls to these functions will need to be | |
1008 | eliminated. | |
1009 | ||
1010 | The simplest mitigation strategy is to use the L-thread sleep API functions, | |
1011 | of which two variants exist, ``lthread_sleep()`` and ``lthread_sleep_clks()``. | |
1012 | These functions start an rte_timer against the L-thread, suspend the L-thread | |
1013 | and cause another ready L-thread to be resumed. The suspended L-thread is | |
1014 | resumed when the rte_timer matures. | |
1015 | ||
1016 | ||
1017 | .. _porting_infinite_loops: | |
1018 | ||
1019 | Infinite loops | |
1020 | ^^^^^^^^^^^^^^ | |
1021 | ||
1022 | Some applications have threads with loops that contain no inherent | |
1023 | rescheduling opportunity, and rely solely on the OS time slicing to share | |
1024 | the CPU. In a cooperative environment this will stop everything dead. These | |
1025 | kind of loops are not hard to identify, in a debug session you will find the | |
1026 | debugger is always stopping in the same loop. | |
1027 | ||
1028 | The simplest solution to this kind of problem is to insert an explicit | |
1029 | ``lthread_yield()`` or ``lthread_sleep()`` into the loop. Another solution | |
1030 | might be to include the function performed by the loop into the execution path | |
1031 | of some other loop that does in fact yield, if this is possible. | |
1032 | ||
1033 | ||
1034 | .. _porting_thread_local_storage: | |
1035 | ||
1036 | Thread local storage | |
1037 | ^^^^^^^^^^^^^^^^^^^^ | |
1038 | ||
1039 | If the application uses thread local storage, the use case should be | |
1040 | studied carefully. | |
1041 | ||
1042 | In a legacy pthread application either or both the ``__thread`` prefix, or the | |
1043 | pthread set/get specific APIs may have been used to define storage local to a | |
1044 | pthread. | |
1045 | ||
1046 | In some applications it may be a reasonable assumption that the data could | |
1047 | or in fact most likely should be placed in L-thread local storage. | |
1048 | ||
1049 | If the application (like many DPDK applications) has assumed a certain | |
1050 | relationship between a pthread and the CPU to which it is affinitized, there | |
1051 | is a risk that thread local storage may have been used to save some data items | |
1052 | that are correctly logically associated with the CPU, and others items which | |
1053 | relate to application context for the thread. Only a good understanding of the | |
1054 | application will reveal such cases. | |
1055 | ||
1056 | If the application requires an that an L-thread is to be able to move between | |
1057 | schedulers then care should be taken to separate these kinds of data, into per | |
1058 | lcore, and per L-thread storage. In this way a migrating thread will bring with | |
1059 | it the local data it needs, and pick up the new logical core specific values | |
1060 | from pthread local storage at its new home. | |
1061 | ||
1062 | ||
1063 | .. _pthread_shim: | |
1064 | ||
1065 | Pthread shim | |
1066 | ~~~~~~~~~~~~ | |
1067 | ||
1068 | A convenient way to get something working with legacy code can be to use a | |
1069 | shim that adapts pthread API calls to the corresponding L-thread ones. | |
1070 | This approach will not mitigate any of the porting considerations mentioned | |
1071 | in the previous sections, but it will reduce the amount of code churn that | |
1072 | would otherwise been involved. It is a reasonable approach to evaluate | |
1073 | L-threads, before investing effort in porting to the native L-thread APIs. | |
1074 | ||
1075 | ||
1076 | Overview | |
1077 | ^^^^^^^^ | |
1078 | The L-thread subsystem includes an example pthread shim. This is a partial | |
1079 | implementation but does contain the API stubs needed to get basic applications | |
1080 | running. There is a simple "hello world" application that demonstrates the | |
1081 | use of the pthread shim. | |
1082 | ||
1083 | A subtlety of working with a shim is that the application will still need | |
1084 | to make use of the genuine pthread library functions, at the very least in | |
1085 | order to create the EAL threads in which the L-thread schedulers will run. | |
1086 | This is the case with DPDK initialization, and exit. | |
1087 | ||
1088 | To deal with the initialization and shutdown scenarios, the shim is capable of | |
1089 | switching on or off its adaptor functionality, an application can control this | |
1090 | behavior by the calling the function ``pt_override_set()``. The default state | |
1091 | is disabled. | |
1092 | ||
1093 | The pthread shim uses the dynamic linker loader and saves the loaded addresses | |
1094 | of the genuine pthread API functions in an internal table, when the shim | |
1095 | functionality is enabled it performs the adaptor function, when disabled it | |
1096 | invokes the genuine pthread function. | |
1097 | ||
1098 | The function ``pthread_exit()`` has additional special handling. The standard | |
1099 | system header file pthread.h declares ``pthread_exit()`` with | |
1100 | ``__attribute__((noreturn))`` this is an optimization that is possible because | |
1101 | the pthread is terminating and this enables the compiler to omit the normal | |
1102 | handling of stack and protection of registers since the function is not | |
1103 | expected to return, and in fact the thread is being destroyed. These | |
1104 | optimizations are applied in both the callee and the caller of the | |
1105 | ``pthread_exit()`` function. | |
1106 | ||
1107 | In our cooperative scheduling environment this behavior is inadmissible. The | |
1108 | pthread is the L-thread scheduler thread, and, although an L-thread is | |
1109 | terminating, there must be a return to the scheduler in order that the system | |
1110 | can continue to run. Further, returning from a function with attribute | |
1111 | ``noreturn`` is invalid and may result in undefined behavior. | |
1112 | ||
1113 | The solution is to redefine the ``pthread_exit`` function with a macro, | |
1114 | causing it to be mapped to a stub function in the shim that does not have the | |
1115 | ``noreturn`` attribute. This macro is defined in the file | |
1116 | ``pthread_shim.h``. The stub function is otherwise no different than any of | |
1117 | the other stub functions in the shim, and will switch between the real | |
1118 | ``pthread_exit()`` function or the ``lthread_exit()`` function as | |
1119 | required. The only difference is that the mapping to the stub by macro | |
1120 | substitution. | |
1121 | ||
1122 | A consequence of this is that the file ``pthread_shim.h`` must be included in | |
1123 | legacy code wishing to make use of the shim. It also means that dynamic | |
1124 | linkage of a pre-compiled binary that did not include pthread_shim.h is not be | |
1125 | supported. | |
1126 | ||
1127 | Given the requirements for porting legacy code outlined in | |
1128 | :ref:`porting_legacy_code_to_run_on_lthreads` most applications will require at | |
1129 | least some minimal adjustment and recompilation to run on L-threads so | |
1130 | pre-compiled binaries are unlikely to be met in practice. | |
1131 | ||
1132 | In summary the shim approach adds some overhead but can be a useful tool to help | |
1133 | establish the feasibility of a code reuse project. It is also a fairly | |
1134 | straightforward task to extend the shim if necessary. | |
1135 | ||
1136 | **Note:** Bearing in mind the preceding discussions about the impact of making | |
1137 | blocking calls then switching the shim in and out on the fly to invoke any | |
1138 | pthread API this might block is something that should typically be avoided. | |
1139 | ||
1140 | ||
1141 | Building and running the pthread shim | |
1142 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
1143 | ||
1144 | The shim example application is located in the sample application | |
1145 | in the performance-thread folder | |
1146 | ||
1147 | To build and run the pthread shim example | |
1148 | ||
1149 | #. Go to the example applications folder | |
1150 | ||
1151 | .. code-block:: console | |
1152 | ||
1153 | export RTE_SDK=/path/to/rte_sdk | |
1154 | cd ${RTE_SDK}/examples/performance-thread/pthread_shim | |
1155 | ||
1156 | ||
1157 | #. Set the target (a default target is used if not specified). For example: | |
1158 | ||
1159 | .. code-block:: console | |
1160 | ||
9f95a23c | 1161 | export RTE_TARGET=x86_64-native-linux-gcc |
7c673cae FG |
1162 | |
1163 | See the DPDK Getting Started Guide for possible RTE_TARGET values. | |
1164 | ||
1165 | #. Build the application: | |
1166 | ||
1167 | .. code-block:: console | |
1168 | ||
1169 | make | |
1170 | ||
1171 | #. To run the pthread_shim example | |
1172 | ||
1173 | .. code-block:: console | |
1174 | ||
1175 | lthread-pthread-shim -c core_mask -n number_of_channels | |
1176 | ||
1177 | .. _lthread_diagnostics: | |
1178 | ||
1179 | L-thread Diagnostics | |
1180 | ~~~~~~~~~~~~~~~~~~~~ | |
1181 | ||
1182 | When debugging you must take account of the fact that the L-threads are run in | |
1183 | a single pthread. The current scheduler is defined by | |
1184 | ``RTE_PER_LCORE(this_sched)``, and the current lthread is stored at | |
1185 | ``RTE_PER_LCORE(this_sched)->current_lthread``. Thus on a breakpoint in a GDB | |
1186 | session the current lthread can be obtained by displaying the pthread local | |
1187 | variable ``per_lcore_this_sched->current_lthread``. | |
1188 | ||
1189 | Another useful diagnostic feature is the possibility to trace significant | |
1190 | events in the life of an L-thread, this feature is enabled by changing the | |
1191 | value of LTHREAD_DIAG from 0 to 1 in the file ``lthread_diag_api.h``. | |
1192 | ||
1193 | Tracing of events can be individually masked, and the mask may be programmed | |
1194 | at run time. An unmasked event results in a callback that provides information | |
1195 | about the event. The default callback simply prints trace information. The | |
1196 | default mask is 0 (all events off) the mask can be modified by calling the | |
1197 | function ``lthread_diagniostic_set_mask()``. | |
1198 | ||
1199 | It is possible register a user callback function to implement more | |
1200 | sophisticated diagnostic functions. | |
1201 | Object creation events (lthread, mutex, and condition variable) accept, and | |
1202 | store in the created object, a user supplied reference value returned by the | |
1203 | callback function. | |
1204 | ||
1205 | The lthread reference value is passed back in all subsequent event callbacks, | |
1206 | the mutex and APIs are provided to retrieve the reference value from | |
1207 | mutexes and condition variables. This enables a user to monitor, count, or | |
1208 | filter for specific events, on specific objects, for example to monitor for a | |
1209 | specific thread signaling a specific condition variable, or to monitor | |
1210 | on all timer events, the possibilities and combinations are endless. | |
1211 | ||
1212 | The callback function can be set by calling the function | |
1213 | ``lthread_diagnostic_enable()`` supplying a callback function pointer and an | |
1214 | event mask. | |
1215 | ||
1216 | Setting ``LTHREAD_DIAG`` also enables counting of statistics about cache and | |
1217 | queue usage, and these statistics can be displayed by calling the function | |
1218 | ``lthread_diag_stats_display()``. This function also performs a consistency | |
1219 | check on the caches and queues. The function should only be called from the | |
1220 | master EAL thread after all slave threads have stopped and returned to the C | |
1221 | main program, otherwise the consistency check will fail. |