]> git.proxmox.com Git - ceph.git/blame - ceph/src/seastar/dpdk/doc/guides/sample_app_ug/performance_thread.rst
import 15.2.0 Octopus source
[ceph.git] / ceph / src / seastar / dpdk / doc / guides / sample_app_ug / performance_thread.rst
CommitLineData
9f95a23c
TL
1.. SPDX-License-Identifier: BSD-3-Clause
2 Copyright(c) 2015 Intel Corporation.
7c673cae
FG
3
4Performance Thread Sample Application
5=====================================
6
7The performance thread sample application is a derivative of the standard L3
8forwarding application that demonstrates different threading models.
9
10Overview
11--------
12For a general description of the L3 forwarding applications capabilities
13please refer to the documentation of the standard application in
14:doc:`l3_forward`.
15
16The performance thread sample application differs from the standard L3
17forwarding example in that it divides the TX and RX processing between
18different threads, and makes it possible to assign individual threads to
19different cores.
20
21Three threading models are considered:
22
23#. When there is one EAL thread per physical core.
24#. When there are multiple EAL threads per physical core.
25#. When there are multiple lightweight threads per EAL thread.
26
27Since DPDK release 2.0 it is possible to launch applications using the
28``--lcores`` EAL parameter, specifying cpu-sets for a physical core. With the
29performance thread sample application its is now also possible to assign
30individual RX and TX functions to different cores.
31
32As an alternative to dividing the L3 forwarding work between different EAL
33threads the performance thread sample introduces the possibility to run the
34application threads as lightweight threads (L-threads) within one or
35more EAL threads.
36
37In order to facilitate this threading model the example includes a primitive
38cooperative scheduler (L-thread) subsystem. More details of the L-thread
39subsystem can be found in :ref:`lthread_subsystem`.
40
41**Note:** Whilst theoretically possible it is not anticipated that multiple
42L-thread schedulers would be run on the same physical core, this mode of
43operation should not be expected to yield useful performance and is considered
44invalid.
45
46Compiling the Application
47-------------------------
7c673cae 48
9f95a23c 49To compile the sample application see :doc:`compiling`.
7c673cae 50
9f95a23c 51The application is located in the `performance-thread/l3fwd-thread` sub-directory.
7c673cae
FG
52
53Running the Application
54-----------------------
55
56The application has a number of command line options::
57
58 ./build/l3fwd-thread [EAL options] --
59 -p PORTMASK [-P]
60 --rx(port,queue,lcore,thread)[,(port,queue,lcore,thread)]
61 --tx(lcore,thread)[,(lcore,thread)]
62 [--enable-jumbo] [--max-pkt-len PKTLEN]] [--no-numa]
63 [--hash-entry-num] [--ipv6] [--no-lthreads] [--stat-lcore lcore]
11fdf7f2 64 [--parse-ptype]
7c673cae
FG
65
66Where:
67
68* ``-p PORTMASK``: Hexadecimal bitmask of ports to configure.
69
70* ``-P``: optional, sets all ports to promiscuous mode so that packets are
71 accepted regardless of the packet's Ethernet MAC destination address.
72 Without this option, only packets with the Ethernet MAC destination address
73 set to the Ethernet address of the port are accepted.
74
75* ``--rx (port,queue,lcore,thread)[,(port,queue,lcore,thread)]``: the list of
76 NIC RX ports and queues handled by the RX lcores and threads. The parameters
77 are explained below.
78
79* ``--tx (lcore,thread)[,(lcore,thread)]``: the list of TX threads identifying
80 the lcore the thread runs on, and the id of RX thread with which it is
81 associated. The parameters are explained below.
82
83* ``--enable-jumbo``: optional, enables jumbo frames.
84
85* ``--max-pkt-len``: optional, maximum packet length in decimal (64-9600).
86
87* ``--no-numa``: optional, disables numa awareness.
88
89* ``--hash-entry-num``: optional, specifies the hash entry number in hex to be
90 setup.
91
92* ``--ipv6``: optional, set it if running ipv6 packets.
93
94* ``--no-lthreads``: optional, disables l-thread model and uses EAL threading
95 model. See below.
96
97* ``--stat-lcore``: optional, run CPU load stats collector on the specified
98 lcore.
99
11fdf7f2
TL
100* ``--parse-ptype:`` optional, set to use software to analyze packet type.
101 Without this option, hardware will check the packet type.
102
7c673cae
FG
103The parameters of the ``--rx`` and ``--tx`` options are:
104
105* ``--rx`` parameters
106
107 .. _table_l3fwd_rx_parameters:
108
109 +--------+------------------------------------------------------+
110 | port | RX port |
111 +--------+------------------------------------------------------+
112 | queue | RX queue that will be read on the specified RX port |
113 +--------+------------------------------------------------------+
114 | lcore | Core to use for the thread |
115 +--------+------------------------------------------------------+
116 | thread | Thread id (continuously from 0 to N) |
117 +--------+------------------------------------------------------+
118
119
120* ``--tx`` parameters
121
122 .. _table_l3fwd_tx_parameters:
123
124 +--------+------------------------------------------------------+
125 | lcore | Core to use for L3 route match and transmit |
126 +--------+------------------------------------------------------+
127 | thread | Id of RX thread to be associated with this TX thread |
128 +--------+------------------------------------------------------+
129
130The ``l3fwd-thread`` application allows you to start packet processing in two
131threading models: L-Threads (default) and EAL Threads (when the
132``--no-lthreads`` parameter is used). For consistency all parameters are used
133in the same way for both models.
134
135
136Running with L-threads
137~~~~~~~~~~~~~~~~~~~~~~
138
139When the L-thread model is used (default option), lcore and thread parameters
140in ``--rx/--tx`` are used to affinitize threads to the selected scheduler.
141
142For example, the following places every l-thread on different lcores::
143
11fdf7f2 144 l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
7c673cae
FG
145 --rx="(0,0,0,0)(1,0,1,1)" \
146 --tx="(2,0)(3,1)"
147
148The following places RX l-threads on lcore 0 and TX l-threads on lcore 1 and 2
149and so on::
150
11fdf7f2 151 l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
7c673cae
FG
152 --rx="(0,0,0,0)(1,0,0,1)" \
153 --tx="(1,0)(2,1)"
154
155
156Running with EAL threads
157~~~~~~~~~~~~~~~~~~~~~~~~
158
159When the ``--no-lthreads`` parameter is used, the L-threading model is turned
160off and EAL threads are used for all processing. EAL threads are enumerated in
161the same way as L-threads, but the ``--lcores`` EAL parameter is used to
162affinitize threads to the selected cpu-set (scheduler). Thus it is possible to
163place every RX and TX thread on different lcores.
164
165For example, the following places every EAL thread on different lcores::
166
11fdf7f2 167 l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
7c673cae
FG
168 --rx="(0,0,0,0)(1,0,1,1)" \
169 --tx="(2,0)(3,1)" \
170 --no-lthreads
171
172
173To affinitize two or more EAL threads to one cpu-set, the EAL ``--lcores``
174parameter is used.
175
176The following places RX EAL threads on lcore 0 and TX EAL threads on lcore 1
177and 2 and so on::
178
11fdf7f2 179 l3fwd-thread -l 0-7 -n 2 --lcores="(0,1)@0,(2,3)@1" -- -P -p 3 \
7c673cae
FG
180 --rx="(0,0,0,0)(1,0,1,1)" \
181 --tx="(2,0)(3,1)" \
182 --no-lthreads
183
184
185Examples
186~~~~~~~~
187
188For selected scenarios the command line configuration of the application for L-threads
189and its corresponding EAL threads command line can be realized as follows:
190
191a) Start every thread on different scheduler (1:1)::
192
11fdf7f2 193 l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
7c673cae
FG
194 --rx="(0,0,0,0)(1,0,1,1)" \
195 --tx="(2,0)(3,1)"
196
197 EAL thread equivalent::
198
11fdf7f2 199 l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
7c673cae
FG
200 --rx="(0,0,0,0)(1,0,1,1)" \
201 --tx="(2,0)(3,1)" \
202 --no-lthreads
203
204b) Start all threads on one core (N:1).
205
206 Start 4 L-threads on lcore 0::
207
11fdf7f2 208 l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
7c673cae
FG
209 --rx="(0,0,0,0)(1,0,0,1)" \
210 --tx="(0,0)(0,1)"
211
212 Start 4 EAL threads on cpu-set 0::
213
11fdf7f2 214 l3fwd-thread -l 0-7 -n 2 --lcores="(0-3)@0" -- -P -p 3 \
7c673cae
FG
215 --rx="(0,0,0,0)(1,0,0,1)" \
216 --tx="(2,0)(3,1)" \
217 --no-lthreads
218
219c) Start threads on different cores (N:M).
220
221 Start 2 L-threads for RX on lcore 0, and 2 L-threads for TX on lcore 1::
222
11fdf7f2 223 l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
7c673cae
FG
224 --rx="(0,0,0,0)(1,0,0,1)" \
225 --tx="(1,0)(1,1)"
226
227 Start 2 EAL threads for RX on cpu-set 0, and 2 EAL threads for TX on
228 cpu-set 1::
229
11fdf7f2 230 l3fwd-thread -l 0-7 -n 2 --lcores="(0-1)@0,(2-3)@1" -- -P -p 3 \
7c673cae
FG
231 --rx="(0,0,0,0)(1,0,1,1)" \
232 --tx="(2,0)(3,1)" \
233 --no-lthreads
234
235Explanation
236-----------
237
238To a great extent the sample application differs little from the standard L3
239forwarding application, and readers are advised to familiarize themselves with
240the material covered in the :doc:`l3_forward` documentation before proceeding.
241
242The following explanation is focused on the way threading is handled in the
243performance thread example.
244
245
246Mode of operation with EAL threads
247~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
248
249The performance thread sample application has split the RX and TX functionality
250into two different threads, and the RX and TX threads are
251interconnected via software rings. With respect to these rings the RX threads
252are producers and the TX threads are consumers.
253
254On initialization the TX and RX threads are started according to the command
255line parameters.
256
257The RX threads poll the network interface queues and post received packets to a
258TX thread via a corresponding software ring.
259
260The TX threads poll software rings, perform the L3 forwarding hash/LPM match,
261and assemble packet bursts before performing burst transmit on the network
262interface.
263
264As with the standard L3 forward application, burst draining of residual packets
265is performed periodically with the period calculated from elapsed time using
266the timestamps counter.
267
268The diagram below illustrates a case with two RX threads and three TX threads.
269
270.. _figure_performance_thread_1:
271
272.. figure:: img/performance_thread_1.*
273
274
275Mode of operation with L-threads
276~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
277
278Like the EAL thread configuration the application has split the RX and TX
279functionality into different threads, and the pairs of RX and TX threads are
280interconnected via software rings.
281
282On initialization an L-thread scheduler is started on every EAL thread. On all
283but the master EAL thread only a a dummy L-thread is initially started.
284The L-thread started on the master EAL thread then spawns other L-threads on
9f95a23c 285different L-thread schedulers according the command line parameters.
7c673cae
FG
286
287The RX threads poll the network interface queues and post received packets
288to a TX thread via the corresponding software ring.
289
290The ring interface is augmented by means of an L-thread condition variable that
291enables the TX thread to be suspended when the TX ring is empty. The RX thread
292signals the condition whenever it posts to the TX ring, causing the TX thread
293to be resumed.
294
295Additionally the TX L-thread spawns a worker L-thread to take care of
296polling the software rings, whilst it handles burst draining of the transmit
297buffer.
298
299The worker threads poll the software rings, perform L3 route lookup and
300assemble packet bursts. If the TX ring is empty the worker thread suspends
301itself by waiting on the condition variable associated with the ring.
302
303Burst draining of residual packets, less than the burst size, is performed by
304the TX thread which sleeps (using an L-thread sleep function) and resumes
305periodically to flush the TX buffer.
306
307This design means that L-threads that have no work, can yield the CPU to other
308L-threads and avoid having to constantly poll the software rings.
309
310The diagram below illustrates a case with two RX threads and three TX functions
311(each comprising a thread that processes forwarding and a thread that
312periodically drains the output buffer of residual packets).
313
314.. _figure_performance_thread_2:
315
316.. figure:: img/performance_thread_2.*
317
318
319CPU load statistics
320~~~~~~~~~~~~~~~~~~~
321
322It is possible to display statistics showing estimated CPU load on each core.
323The statistics indicate the percentage of CPU time spent: processing
324received packets (forwarding), polling queues/rings (waiting for work),
325and doing any other processing (context switch and other overhead).
326
327When enabled statistics are gathered by having the application threads set and
328clear flags when they enter and exit pertinent code sections. The flags are
329then sampled in real time by a statistics collector thread running on another
330core. This thread displays the data in real time on the console.
331
332This feature is enabled by designating a statistics collector core, using the
333``--stat-lcore`` parameter.
334
335
336.. _lthread_subsystem:
337
338The L-thread subsystem
339----------------------
340
341The L-thread subsystem resides in the examples/performance-thread/common
342directory and is built and linked automatically when building the
343``l3fwd-thread`` example.
344
345The subsystem provides a simple cooperative scheduler to enable arbitrary
346functions to run as cooperative threads within a single EAL thread.
347The subsystem provides a pthread like API that is intended to assist in
348reuse of legacy code written for POSIX pthreads.
349
350The following sections provide some detail on the features, constraints,
351performance and porting considerations when using L-threads.
352
353
354.. _comparison_between_lthreads_and_pthreads:
355
356Comparison between L-threads and POSIX pthreads
357~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
358
359The fundamental difference between the L-thread and pthread models is the
360way in which threads are scheduled. The simplest way to think about this is to
361consider the case of a processor with a single CPU. To run multiple threads
362on a single CPU, the scheduler must frequently switch between the threads,
363in order that each thread is able to make timely progress.
364This is the basis of any multitasking operating system.
365
366This section explores the differences between the pthread model and the
367L-thread model as implemented in the provided L-thread subsystem. If needed a
368theoretical discussion of preemptive vs cooperative multi-threading can be
369found in any good text on operating system design.
370
371
372Scheduling and context switching
373^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
374
375The POSIX pthread library provides an application programming interface to
376create and synchronize threads. Scheduling policy is determined by the host OS,
377and may be configurable. The OS may use sophisticated rules to determine which
378thread should be run next, threads may suspend themselves or make other threads
379ready, and the scheduler may employ a time slice giving each thread a maximum
380time quantum after which it will be preempted in favor of another thread that
381is ready to run. To complicate matters further threads may be assigned
382different scheduling priorities.
383
384By contrast the L-thread subsystem is considerably simpler. Logically the
385L-thread scheduler performs the same multiplexing function for L-threads
386within a single pthread as the OS scheduler does for pthreads within an
387application process. The L-thread scheduler is simply the main loop of a
388pthread, and in so far as the host OS is concerned it is a regular pthread
389just like any other. The host OS is oblivious about the existence of and
390not at all involved in the scheduling of L-threads.
391
392The other and most significant difference between the two models is that
393L-threads are scheduled cooperatively. L-threads cannot not preempt each
394other, nor can the L-thread scheduler preempt a running L-thread (i.e.
395there is no time slicing). The consequence is that programs implemented with
396L-threads must possess frequent rescheduling points, meaning that they must
397explicitly and of their own volition return to the scheduler at frequent
398intervals, in order to allow other L-threads an opportunity to proceed.
399
400In both models switching between threads requires that the current CPU
401context is saved and a new context (belonging to the next thread ready to run)
402is restored. With pthreads this context switching is handled transparently
403and the set of CPU registers that must be preserved between context switches
404is as per an interrupt handler.
405
406An L-thread context switch is achieved by the thread itself making a function
407call to the L-thread scheduler. Thus it is only necessary to preserve the
408callee registers. The caller is responsible to save and restore any other
409registers it is using before a function call, and restore them on return,
410and this is handled by the compiler. For ``X86_64`` on both Linux and BSD the
411System V calling convention is used, this defines registers RSP, RBP, and
412R12-R15 as callee-save registers (for more detailed discussion a good reference
413is `X86 Calling Conventions <https://en.wikipedia.org/wiki/X86_calling_conventions>`_).
414
415Taking advantage of this, and due to the absence of preemption, an L-thread
416context switch is achieved with less than 20 load/store instructions.
417
418The scheduling policy for L-threads is fixed, there is no prioritization of
419L-threads, all L-threads are equal and scheduling is based on a FIFO
420ready queue.
421
422An L-thread is a struct containing the CPU context of the thread
423(saved on context switch) and other useful items. The ready queue contains
424pointers to threads that are ready to run. The L-thread scheduler is a simple
425loop that polls the ready queue, reads from it the next thread ready to run,
426which it resumes by saving the current context (the current position in the
427scheduler loop) and restoring the context of the next thread from its thread
428struct. Thus an L-thread is always resumed at the last place it yielded.
429
430A well behaved L-thread will call the context switch regularly (at least once
431in its main loop) thus returning to the scheduler's own main loop. Yielding
432inserts the current thread at the back of the ready queue, and the process of
433servicing the ready queue is repeated, thus the system runs by flipping back
434and forth the between L-threads and scheduler loop.
435
436In the case of pthreads, the preemptive scheduling, time slicing, and support
437for thread prioritization means that progress is normally possible for any
438thread that is ready to run. This comes at the price of a relatively heavier
439context switch and scheduling overhead.
440
441With L-threads the progress of any particular thread is determined by the
442frequency of rescheduling opportunities in the other L-threads. This means that
443an errant L-thread monopolizing the CPU might cause scheduling of other threads
444to be stalled. Due to the lower cost of context switching, however, voluntary
445rescheduling to ensure progress of other threads, if managed sensibly, is not
446a prohibitive overhead, and overall performance can exceed that of an
447application using pthreads.
448
449
450Mutual exclusion
451^^^^^^^^^^^^^^^^
452
453With pthreads preemption means that threads that share data must observe
454some form of mutual exclusion protocol.
455
456The fact that L-threads cannot preempt each other means that in many cases
457mutual exclusion devices can be completely avoided.
458
459Locking to protect shared data can be a significant bottleneck in
460multi-threaded applications so a carefully designed cooperatively scheduled
461program can enjoy significant performance advantages.
462
463So far we have considered only the simplistic case of a single core CPU,
464when multiple CPUs are considered things are somewhat more complex.
465
466First of all it is inevitable that there must be multiple L-thread schedulers,
467one running on each EAL thread. So long as these schedulers remain isolated
468from each other the above assertions about the potential advantages of
469cooperative scheduling hold true.
470
471A configuration with isolated cooperative schedulers is less flexible than the
472pthread model where threads can be affinitized to run on any CPU. With isolated
473schedulers scaling of applications to utilize fewer or more CPUs according to
474system demand is very difficult to achieve.
475
476The L-thread subsystem makes it possible for L-threads to migrate between
477schedulers running on different CPUs. Needless to say if the migration means
478that threads that share data end up running on different CPUs then this will
479introduce the need for some kind of mutual exclusion system.
480
481Of course ``rte_ring`` software rings can always be used to interconnect
482threads running on different cores, however to protect other kinds of shared
483data structures, lock free constructs or else explicit locking will be
484required. This is a consideration for the application design.
485
486In support of this extended functionality, the L-thread subsystem implements
487thread safe mutexes and condition variables.
488
489The cost of affinitizing and of condition variable signaling is significantly
490lower than the equivalent pthread operations, and so applications using these
491features will see a performance benefit.
492
493
494Thread local storage
495^^^^^^^^^^^^^^^^^^^^
496
497As with applications written for pthreads an application written for L-threads
498can take advantage of thread local storage, in this case local to an L-thread.
499An application may save and retrieve a single pointer to application data in
500the L-thread struct.
501
502For legacy and backward compatibility reasons two alternative methods are also
9f95a23c
TL
503offered, the first is modeled directly on the pthread get/set specific APIs,
504the second approach is modeled on the ``RTE_PER_LCORE`` macros, whereby
7c673cae
FG
505``PER_LTHREAD`` macros are introduced, in both cases the storage is local to
506the L-thread.
507
508
509.. _constraints_and_performance_implications:
510
511Constraints and performance implications when using L-threads
512~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
513
514
515.. _API_compatibility:
516
517API compatibility
518^^^^^^^^^^^^^^^^^
519
520The L-thread subsystem provides a set of functions that are logically equivalent
521to the corresponding functions offered by the POSIX pthread library, however not
522all pthread functions have a corresponding L-thread equivalent, and not all
523features available to pthreads are implemented for L-threads.
524
525The pthread library offers considerable flexibility via programmable attributes
526that can be associated with threads, mutexes, and condition variables.
527
528By contrast the L-thread subsystem has fixed functionality, the scheduler policy
529cannot be varied, and L-threads cannot be prioritized. There are no variable
530attributes associated with any L-thread objects. L-threads, mutexes and
531conditional variables, all have fixed functionality. (Note: reserved parameters
532are included in the APIs to facilitate possible future support for attributes).
533
534The table below lists the pthread and equivalent L-thread APIs with notes on
535differences and/or constraints. Where there is no L-thread entry in the table,
536then the L-thread subsystem provides no equivalent function.
537
538.. _table_lthread_pthread:
539
540.. table:: Pthread and equivalent L-thread APIs.
541
542 +----------------------------+------------------------+-------------------+
543 | **Pthread function** | **L-thread function** | **Notes** |
544 +============================+========================+===================+
545 | pthread_barrier_destroy | | |
546 +----------------------------+------------------------+-------------------+
547 | pthread_barrier_init | | |
548 +----------------------------+------------------------+-------------------+
549 | pthread_barrier_wait | | |
550 +----------------------------+------------------------+-------------------+
551 | pthread_cond_broadcast | lthread_cond_broadcast | See note 1 |
552 +----------------------------+------------------------+-------------------+
553 | pthread_cond_destroy | lthread_cond_destroy | |
554 +----------------------------+------------------------+-------------------+
555 | pthread_cond_init | lthread_cond_init | |
556 +----------------------------+------------------------+-------------------+
557 | pthread_cond_signal | lthread_cond_signal | See note 1 |
558 +----------------------------+------------------------+-------------------+
559 | pthread_cond_timedwait | | |
560 +----------------------------+------------------------+-------------------+
561 | pthread_cond_wait | lthread_cond_wait | See note 5 |
562 +----------------------------+------------------------+-------------------+
563 | pthread_create | lthread_create | See notes 2, 3 |
564 +----------------------------+------------------------+-------------------+
565 | pthread_detach | lthread_detach | See note 4 |
566 +----------------------------+------------------------+-------------------+
567 | pthread_equal | | |
568 +----------------------------+------------------------+-------------------+
569 | pthread_exit | lthread_exit | |
570 +----------------------------+------------------------+-------------------+
571 | pthread_getspecific | lthread_getspecific | |
572 +----------------------------+------------------------+-------------------+
573 | pthread_getcpuclockid | | |
574 +----------------------------+------------------------+-------------------+
575 | pthread_join | lthread_join | |
576 +----------------------------+------------------------+-------------------+
577 | pthread_key_create | lthread_key_create | |
578 +----------------------------+------------------------+-------------------+
579 | pthread_key_delete | lthread_key_delete | |
580 +----------------------------+------------------------+-------------------+
581 | pthread_mutex_destroy | lthread_mutex_destroy | |
582 +----------------------------+------------------------+-------------------+
583 | pthread_mutex_init | lthread_mutex_init | |
584 +----------------------------+------------------------+-------------------+
585 | pthread_mutex_lock | lthread_mutex_lock | See note 6 |
586 +----------------------------+------------------------+-------------------+
587 | pthread_mutex_trylock | lthread_mutex_trylock | See note 6 |
588 +----------------------------+------------------------+-------------------+
589 | pthread_mutex_timedlock | | |
590 +----------------------------+------------------------+-------------------+
591 | pthread_mutex_unlock | lthread_mutex_unlock | |
592 +----------------------------+------------------------+-------------------+
593 | pthread_once | | |
594 +----------------------------+------------------------+-------------------+
595 | pthread_rwlock_destroy | | |
596 +----------------------------+------------------------+-------------------+
597 | pthread_rwlock_init | | |
598 +----------------------------+------------------------+-------------------+
599 | pthread_rwlock_rdlock | | |
600 +----------------------------+------------------------+-------------------+
601 | pthread_rwlock_timedrdlock | | |
602 +----------------------------+------------------------+-------------------+
603 | pthread_rwlock_timedwrlock | | |
604 +----------------------------+------------------------+-------------------+
605 | pthread_rwlock_tryrdlock | | |
606 +----------------------------+------------------------+-------------------+
607 | pthread_rwlock_trywrlock | | |
608 +----------------------------+------------------------+-------------------+
609 | pthread_rwlock_unlock | | |
610 +----------------------------+------------------------+-------------------+
611 | pthread_rwlock_wrlock | | |
612 +----------------------------+------------------------+-------------------+
613 | pthread_self | lthread_current | |
614 +----------------------------+------------------------+-------------------+
615 | pthread_setspecific | lthread_setspecific | |
616 +----------------------------+------------------------+-------------------+
617 | pthread_spin_init | | See note 10 |
618 +----------------------------+------------------------+-------------------+
619 | pthread_spin_destroy | | See note 10 |
620 +----------------------------+------------------------+-------------------+
621 | pthread_spin_lock | | See note 10 |
622 +----------------------------+------------------------+-------------------+
623 | pthread_spin_trylock | | See note 10 |
624 +----------------------------+------------------------+-------------------+
625 | pthread_spin_unlock | | See note 10 |
626 +----------------------------+------------------------+-------------------+
627 | pthread_cancel | lthread_cancel | |
628 +----------------------------+------------------------+-------------------+
629 | pthread_setcancelstate | | |
630 +----------------------------+------------------------+-------------------+
631 | pthread_setcanceltype | | |
632 +----------------------------+------------------------+-------------------+
633 | pthread_testcancel | | |
634 +----------------------------+------------------------+-------------------+
635 | pthread_getschedparam | | |
636 +----------------------------+------------------------+-------------------+
637 | pthread_setschedparam | | |
638 +----------------------------+------------------------+-------------------+
639 | pthread_yield | lthread_yield | See note 7 |
640 +----------------------------+------------------------+-------------------+
641 | pthread_setaffinity_np | lthread_set_affinity | See notes 2, 3, 8 |
642 +----------------------------+------------------------+-------------------+
643 | | lthread_sleep | See note 9 |
644 +----------------------------+------------------------+-------------------+
645 | | lthread_sleep_clks | See note 9 |
646 +----------------------------+------------------------+-------------------+
647
648
649**Note 1**:
650
651Neither lthread signal nor broadcast may be called concurrently by L-threads
652running on different schedulers, although multiple L-threads running in the
653same scheduler may freely perform signal or broadcast operations. L-threads
654running on the same or different schedulers may always safely wait on a
655condition variable.
656
657
658**Note 2**:
659
660Pthread attributes may be used to affinitize a pthread with a cpu-set. The
661L-thread subsystem does not support a cpu-set. An L-thread may be affinitized
662only with a single CPU at any time.
663
664
665**Note 3**:
666
667If an L-thread is intended to run on a different NUMA node than the node that
668creates the thread then, when calling ``lthread_create()`` it is advantageous
669to specify the destination core as a parameter of ``lthread_create()``. See
670:ref:`memory_allocation_and_NUMA_awareness` for details.
671
672
673**Note 4**:
674
675An L-thread can only detach itself, and cannot detach other L-threads.
676
677
678**Note 5**:
679
680A wait operation on a pthread condition variable is always associated with and
681protected by a mutex which must be owned by the thread at the time it invokes
682``pthread_wait()``. By contrast L-thread condition variables are thread safe
683(for waiters) and do not use an associated mutex. Multiple L-threads (including
684L-threads running on other schedulers) can safely wait on a L-thread condition
685variable. As a consequence the performance of an L-thread condition variables
686is typically an order of magnitude faster than its pthread counterpart.
687
688
689**Note 6**:
690
691Recursive locking is not supported with L-threads, attempts to take a lock
692recursively will be detected and rejected.
693
694
695**Note 7**:
696
697``lthread_yield()`` will save the current context, insert the current thread
698to the back of the ready queue, and resume the next ready thread. Yielding
699increases ready queue backlog, see :ref:`ready_queue_backlog` for more details
700about the implications of this.
701
702
703N.B. The context switch time as measured from immediately before the call to
704``lthread_yield()`` to the point at which the next ready thread is resumed,
705can be an order of magnitude faster that the same measurement for
706pthread_yield.
707
708
709**Note 8**:
710
711``lthread_set_affinity()`` is similar to a yield apart from the fact that the
712yielding thread is inserted into a peer ready queue of another scheduler.
713The peer ready queue is actually a separate thread safe queue, which means that
714threads appearing in the peer ready queue can jump any backlog in the local
715ready queue on the destination scheduler.
716
717The context switch time as measured from the time just before the call to
718``lthread_set_affinity()`` to just after the same thread is resumed on the new
719scheduler can be orders of magnitude faster than the same measurement for
720``pthread_setaffinity_np()``.
721
722
723**Note 9**:
724
725Although there is no ``pthread_sleep()`` function, ``lthread_sleep()`` and
726``lthread_sleep_clks()`` can be used wherever ``sleep()``, ``usleep()`` or
727``nanosleep()`` might ordinarily be used. The L-thread sleep functions suspend
728the current thread, start an ``rte_timer`` and resume the thread when the
729timer matures. The ``rte_timer_manage()`` entry point is called on every pass
730of the scheduler loop. This means that the worst case jitter on timer expiry
731is determined by the longest period between context switches of any running
732L-threads.
733
734In a synthetic test with many threads sleeping and resuming then the measured
735jitter is typically orders of magnitude lower than the same measurement made
736for ``nanosleep()``.
737
738
739**Note 10**:
740
741Spin locks are not provided because they are problematical in a cooperative
742environment, see :ref:`porting_locks_and_spinlocks` for a more detailed
743discussion on how to avoid spin locks.
744
745
746.. _Thread_local_storage_performance:
747
748Thread local storage
749^^^^^^^^^^^^^^^^^^^^
750
751Of the three L-thread local storage options the simplest and most efficient is
752storing a single application data pointer in the L-thread struct.
753
754The ``PER_LTHREAD`` macros involve a run time computation to obtain the address
755of the variable being saved/retrieved and also require that the accesses are
756de-referenced via a pointer. This means that code that has used
757``RTE_PER_LCORE`` macros being ported to L-threads might need some slight
758adjustment (see :ref:`porting_thread_local_storage` for hints about porting
759code that makes use of thread local storage).
760
761The get/set specific APIs are consistent with their pthread counterparts both
762in use and in performance.
763
764
765.. _memory_allocation_and_NUMA_awareness:
766
767Memory allocation and NUMA awareness
768^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
769
770All memory allocation is from DPDK huge pages, and is NUMA aware. Each
771scheduler maintains its own caches of objects: lthreads, their stacks, TLS,
772mutexes and condition variables. These caches are implemented as unbounded lock
773free MPSC queues. When objects are created they are always allocated from the
774caches on the local core (current EAL thread).
775
776If an L-thread has been affinitized to a different scheduler, then it can
777always safely free resources to the caches from which they originated (because
778the caches are MPSC queues).
779
780If the L-thread has been affinitized to a different NUMA node then the memory
781resources associated with it may incur longer access latency.
782
783The commonly used pattern of setting affinity on entry to a thread after it has
784started, means that memory allocation for both the stack and TLS will have been
785made from caches on the NUMA node on which the threads creator is running.
786This has the side effect that access latency will be sub-optimal after
787affinitizing.
788
789This side effect can be mitigated to some extent (although not completely) by
790specifying the destination CPU as a parameter of ``lthread_create()`` this
791causes the L-thread's stack and TLS to be allocated when it is first scheduled
792on the destination scheduler, if the destination is a on another NUMA node it
793results in a more optimal memory allocation.
794
795Note that the lthread struct itself remains allocated from memory on the
796creating node, this is unavoidable because an L-thread is known everywhere by
797the address of this struct.
798
799
800.. _object_cache_sizing:
801
802Object cache sizing
803^^^^^^^^^^^^^^^^^^^
804
805The per lcore object caches pre-allocate objects in bulk whenever a request to
806allocate an object finds a cache empty. By default 100 objects are
807pre-allocated, this is defined by ``LTHREAD_PREALLOC`` in the public API
808header file lthread_api.h. This means that the caches constantly grow to meet
809system demand.
810
811In the present implementation there is no mechanism to reduce the cache sizes
812if system demand reduces. Thus the caches will remain at their maximum extent
813indefinitely.
814
815A consequence of the bulk pre-allocation of objects is that every 100 (default
816value) additional new object create operations results in a call to
817``rte_malloc()``. For creation of objects such as L-threads, which trigger the
818allocation of even more objects (i.e. their stacks and TLS) then this can
819cause outliers in scheduling performance.
820
821If this is a problem the simplest mitigation strategy is to dimension the
822system, by setting the bulk object pre-allocation size to some large number
823that you do not expect to be exceeded. This means the caches will be populated
824once only, the very first time a thread is created.
825
826
827.. _Ready_queue_backlog:
828
829Ready queue backlog
830^^^^^^^^^^^^^^^^^^^
831
832One of the more subtle performance considerations is managing the ready queue
833backlog. The fewer threads that are waiting in the ready queue then the faster
834any particular thread will get serviced.
835
836In a naive L-thread application with N L-threads simply looping and yielding,
837this backlog will always be equal to the number of L-threads, thus the cost of
838a yield to a particular L-thread will be N times the context switch time.
839
840This side effect can be mitigated by arranging for threads to be suspended and
841wait to be resumed, rather than polling for work by constantly yielding.
842Blocking on a mutex or condition variable or even more obviously having a
843thread sleep if it has a low frequency workload are all mechanisms by which a
844thread can be excluded from the ready queue until it really does need to be
845run. This can have a significant positive impact on performance.
846
847
848.. _Initialization_and_shutdown_dependencies:
849
850Initialization, shutdown and dependencies
851^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
852
853The L-thread subsystem depends on DPDK for huge page allocation and depends on
854the ``rte_timer subsystem``. The DPDK EAL initialization and
855``rte_timer_subsystem_init()`` **MUST** be completed before the L-thread sub
856system can be used.
857
858Thereafter initialization of the L-thread subsystem is largely transparent to
859the application. Constructor functions ensure that global variables are properly
860initialized. Other than global variables each scheduler is initialized
861independently the first time that an L-thread is created by a particular EAL
862thread.
863
864If the schedulers are to be run as isolated and independent schedulers, with
865no intention that L-threads running on different schedulers will migrate between
866schedulers or synchronize with L-threads running on other schedulers, then
867initialization consists simply of creating an L-thread, and then running the
868L-thread scheduler.
869
870If there will be interaction between L-threads running on different schedulers,
871then it is important that the starting of schedulers on different EAL threads
872is synchronized.
873
874To achieve this an additional initialization step is necessary, this is simply
875to set the number of schedulers by calling the API function
876``lthread_num_schedulers_set(n)``, where ``n`` is the number of EAL threads
877that will run L-thread schedulers. Setting the number of schedulers to a
878number greater than 0 will cause all schedulers to wait until the others have
879started before beginning to schedule L-threads.
880
881The L-thread scheduler is started by calling the function ``lthread_run()``
882and should be called from the EAL thread and thus become the main loop of the
883EAL thread.
884
885The function ``lthread_run()``, will not return until all threads running on
886the scheduler have exited, and the scheduler has been explicitly stopped by
887calling ``lthread_scheduler_shutdown(lcore)`` or
888``lthread_scheduler_shutdown_all()``.
889
890All these function do is tell the scheduler that it can exit when there are no
891longer any running L-threads, neither function forces any running L-thread to
892terminate. Any desired application shutdown behavior must be designed and
893built into the application to ensure that L-threads complete in a timely
894manner.
895
896**Important Note:** It is assumed when the scheduler exits that the application
897is terminating for good, the scheduler does not free resources before exiting
898and running the scheduler a subsequent time will result in undefined behavior.
899
900
901.. _porting_legacy_code_to_run_on_lthreads:
902
903Porting legacy code to run on L-threads
904~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
905
906Legacy code originally written for a pthread environment may be ported to
907L-threads if the considerations about differences in scheduling policy, and
908constraints discussed in the previous sections can be accommodated.
909
910This section looks in more detail at some of the issues that may have to be
911resolved when porting code.
912
913
914.. _pthread_API_compatibility:
915
916pthread API compatibility
917^^^^^^^^^^^^^^^^^^^^^^^^^
918
919The first step is to establish exactly which pthread APIs the legacy
920application uses, and to understand the requirements of those APIs. If there
921are corresponding L-lthread APIs, and where the default pthread functionality
922is used by the application then, notwithstanding the other issues discussed
923here, it should be feasible to run the application with L-threads. If the
924legacy code modifies the default behavior using attributes then if may be
925necessary to make some adjustments to eliminate those requirements.
926
927
928.. _blocking_system_calls:
929
930Blocking system API calls
931^^^^^^^^^^^^^^^^^^^^^^^^^
932
933It is important to understand what other system services the application may be
934using, bearing in mind that in a cooperatively scheduled environment a thread
935cannot block without stalling the scheduler and with it all other cooperative
936threads. Any kind of blocking system call, for example file or socket IO, is a
937potential problem, a good tool to analyze the application for this purpose is
938the ``strace`` utility.
939
940There are many strategies to resolve these kind of issues, each with it
941merits. Possible solutions include:
942
943* Adopting a polled mode of the system API concerned (if available).
944
945* Arranging for another core to perform the function and synchronizing with
946 that core via constructs that will not block the L-thread.
947
948* Affinitizing the thread to another scheduler devoted (as a matter of policy)
949 to handling threads wishing to make blocking calls, and then back again when
950 finished.
951
952
953.. _porting_locks_and_spinlocks:
954
955Locks and spinlocks
956^^^^^^^^^^^^^^^^^^^
957
958Locks and spinlocks are another source of blocking behavior that for the same
959reasons as system calls will need to be addressed.
960
961If the application design ensures that the contending L-threads will always
962run on the same scheduler then it its probably safe to remove locks and spin
963locks completely.
964
965The only exception to the above rule is if for some reason the
966code performs any kind of context switch whilst holding the lock
967(e.g. yield, sleep, or block on a different lock, or on a condition variable).
968This will need to determined before deciding to eliminate a lock.
969
970If a lock cannot be eliminated then an L-thread mutex can be substituted for
971either kind of lock.
972
973An L-thread blocking on an L-thread mutex will be suspended and will cause
974another ready L-thread to be resumed, thus not blocking the scheduler. When
975default behavior is required, it can be used as a direct replacement for a
976pthread mutex lock.
977
978Spin locks are typically used when lock contention is likely to be rare and
979where the period during which the lock may be held is relatively short.
980When the contending L-threads are running on the same scheduler then an
981L-thread blocking on a spin lock will enter an infinite loop stopping the
982scheduler completely (see :ref:`porting_infinite_loops` below).
983
984If the application design ensures that contending L-threads will always run
985on different schedulers then it might be reasonable to leave a short spin lock
986that rarely experiences contention in place.
987
988If after all considerations it appears that a spin lock can neither be
989eliminated completely, replaced with an L-thread mutex, or left in place as
990is, then an alternative is to loop on a flag, with a call to
991``lthread_yield()`` inside the loop (n.b. if the contending L-threads might
992ever run on different schedulers the flag will need to be manipulated
993atomically).
994
995Spinning and yielding is the least preferred solution since it introduces
996ready queue backlog (see also :ref:`ready_queue_backlog`).
997
998
999.. _porting_sleeps_and_delays:
1000
1001Sleeps and delays
1002^^^^^^^^^^^^^^^^^
1003
1004Yet another kind of blocking behavior (albeit momentary) are delay functions
1005like ``sleep()``, ``usleep()``, ``nanosleep()`` etc. All will have the
1006consequence of stalling the L-thread scheduler and unless the delay is very
1007short (e.g. a very short nanosleep) calls to these functions will need to be
1008eliminated.
1009
1010The simplest mitigation strategy is to use the L-thread sleep API functions,
1011of which two variants exist, ``lthread_sleep()`` and ``lthread_sleep_clks()``.
1012These functions start an rte_timer against the L-thread, suspend the L-thread
1013and cause another ready L-thread to be resumed. The suspended L-thread is
1014resumed when the rte_timer matures.
1015
1016
1017.. _porting_infinite_loops:
1018
1019Infinite loops
1020^^^^^^^^^^^^^^
1021
1022Some applications have threads with loops that contain no inherent
1023rescheduling opportunity, and rely solely on the OS time slicing to share
1024the CPU. In a cooperative environment this will stop everything dead. These
1025kind of loops are not hard to identify, in a debug session you will find the
1026debugger is always stopping in the same loop.
1027
1028The simplest solution to this kind of problem is to insert an explicit
1029``lthread_yield()`` or ``lthread_sleep()`` into the loop. Another solution
1030might be to include the function performed by the loop into the execution path
1031of some other loop that does in fact yield, if this is possible.
1032
1033
1034.. _porting_thread_local_storage:
1035
1036Thread local storage
1037^^^^^^^^^^^^^^^^^^^^
1038
1039If the application uses thread local storage, the use case should be
1040studied carefully.
1041
1042In a legacy pthread application either or both the ``__thread`` prefix, or the
1043pthread set/get specific APIs may have been used to define storage local to a
1044pthread.
1045
1046In some applications it may be a reasonable assumption that the data could
1047or in fact most likely should be placed in L-thread local storage.
1048
1049If the application (like many DPDK applications) has assumed a certain
1050relationship between a pthread and the CPU to which it is affinitized, there
1051is a risk that thread local storage may have been used to save some data items
1052that are correctly logically associated with the CPU, and others items which
1053relate to application context for the thread. Only a good understanding of the
1054application will reveal such cases.
1055
1056If the application requires an that an L-thread is to be able to move between
1057schedulers then care should be taken to separate these kinds of data, into per
1058lcore, and per L-thread storage. In this way a migrating thread will bring with
1059it the local data it needs, and pick up the new logical core specific values
1060from pthread local storage at its new home.
1061
1062
1063.. _pthread_shim:
1064
1065Pthread shim
1066~~~~~~~~~~~~
1067
1068A convenient way to get something working with legacy code can be to use a
1069shim that adapts pthread API calls to the corresponding L-thread ones.
1070This approach will not mitigate any of the porting considerations mentioned
1071in the previous sections, but it will reduce the amount of code churn that
1072would otherwise been involved. It is a reasonable approach to evaluate
1073L-threads, before investing effort in porting to the native L-thread APIs.
1074
1075
1076Overview
1077^^^^^^^^
1078The L-thread subsystem includes an example pthread shim. This is a partial
1079implementation but does contain the API stubs needed to get basic applications
1080running. There is a simple "hello world" application that demonstrates the
1081use of the pthread shim.
1082
1083A subtlety of working with a shim is that the application will still need
1084to make use of the genuine pthread library functions, at the very least in
1085order to create the EAL threads in which the L-thread schedulers will run.
1086This is the case with DPDK initialization, and exit.
1087
1088To deal with the initialization and shutdown scenarios, the shim is capable of
1089switching on or off its adaptor functionality, an application can control this
1090behavior by the calling the function ``pt_override_set()``. The default state
1091is disabled.
1092
1093The pthread shim uses the dynamic linker loader and saves the loaded addresses
1094of the genuine pthread API functions in an internal table, when the shim
1095functionality is enabled it performs the adaptor function, when disabled it
1096invokes the genuine pthread function.
1097
1098The function ``pthread_exit()`` has additional special handling. The standard
1099system header file pthread.h declares ``pthread_exit()`` with
1100``__attribute__((noreturn))`` this is an optimization that is possible because
1101the pthread is terminating and this enables the compiler to omit the normal
1102handling of stack and protection of registers since the function is not
1103expected to return, and in fact the thread is being destroyed. These
1104optimizations are applied in both the callee and the caller of the
1105``pthread_exit()`` function.
1106
1107In our cooperative scheduling environment this behavior is inadmissible. The
1108pthread is the L-thread scheduler thread, and, although an L-thread is
1109terminating, there must be a return to the scheduler in order that the system
1110can continue to run. Further, returning from a function with attribute
1111``noreturn`` is invalid and may result in undefined behavior.
1112
1113The solution is to redefine the ``pthread_exit`` function with a macro,
1114causing it to be mapped to a stub function in the shim that does not have the
1115``noreturn`` attribute. This macro is defined in the file
1116``pthread_shim.h``. The stub function is otherwise no different than any of
1117the other stub functions in the shim, and will switch between the real
1118``pthread_exit()`` function or the ``lthread_exit()`` function as
1119required. The only difference is that the mapping to the stub by macro
1120substitution.
1121
1122A consequence of this is that the file ``pthread_shim.h`` must be included in
1123legacy code wishing to make use of the shim. It also means that dynamic
1124linkage of a pre-compiled binary that did not include pthread_shim.h is not be
1125supported.
1126
1127Given the requirements for porting legacy code outlined in
1128:ref:`porting_legacy_code_to_run_on_lthreads` most applications will require at
1129least some minimal adjustment and recompilation to run on L-threads so
1130pre-compiled binaries are unlikely to be met in practice.
1131
1132In summary the shim approach adds some overhead but can be a useful tool to help
1133establish the feasibility of a code reuse project. It is also a fairly
1134straightforward task to extend the shim if necessary.
1135
1136**Note:** Bearing in mind the preceding discussions about the impact of making
1137blocking calls then switching the shim in and out on the fly to invoke any
1138pthread API this might block is something that should typically be avoided.
1139
1140
1141Building and running the pthread shim
1142^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1143
1144The shim example application is located in the sample application
1145in the performance-thread folder
1146
1147To build and run the pthread shim example
1148
1149#. Go to the example applications folder
1150
1151 .. code-block:: console
1152
1153 export RTE_SDK=/path/to/rte_sdk
1154 cd ${RTE_SDK}/examples/performance-thread/pthread_shim
1155
1156
1157#. Set the target (a default target is used if not specified). For example:
1158
1159 .. code-block:: console
1160
9f95a23c 1161 export RTE_TARGET=x86_64-native-linux-gcc
7c673cae
FG
1162
1163 See the DPDK Getting Started Guide for possible RTE_TARGET values.
1164
1165#. Build the application:
1166
1167 .. code-block:: console
1168
1169 make
1170
1171#. To run the pthread_shim example
1172
1173 .. code-block:: console
1174
1175 lthread-pthread-shim -c core_mask -n number_of_channels
1176
1177.. _lthread_diagnostics:
1178
1179L-thread Diagnostics
1180~~~~~~~~~~~~~~~~~~~~
1181
1182When debugging you must take account of the fact that the L-threads are run in
1183a single pthread. The current scheduler is defined by
1184``RTE_PER_LCORE(this_sched)``, and the current lthread is stored at
1185``RTE_PER_LCORE(this_sched)->current_lthread``. Thus on a breakpoint in a GDB
1186session the current lthread can be obtained by displaying the pthread local
1187variable ``per_lcore_this_sched->current_lthread``.
1188
1189Another useful diagnostic feature is the possibility to trace significant
1190events in the life of an L-thread, this feature is enabled by changing the
1191value of LTHREAD_DIAG from 0 to 1 in the file ``lthread_diag_api.h``.
1192
1193Tracing of events can be individually masked, and the mask may be programmed
1194at run time. An unmasked event results in a callback that provides information
1195about the event. The default callback simply prints trace information. The
1196default mask is 0 (all events off) the mask can be modified by calling the
1197function ``lthread_diagniostic_set_mask()``.
1198
1199It is possible register a user callback function to implement more
1200sophisticated diagnostic functions.
1201Object creation events (lthread, mutex, and condition variable) accept, and
1202store in the created object, a user supplied reference value returned by the
1203callback function.
1204
1205The lthread reference value is passed back in all subsequent event callbacks,
1206the mutex and APIs are provided to retrieve the reference value from
1207mutexes and condition variables. This enables a user to monitor, count, or
1208filter for specific events, on specific objects, for example to monitor for a
1209specific thread signaling a specific condition variable, or to monitor
1210on all timer events, the possibilities and combinations are endless.
1211
1212The callback function can be set by calling the function
1213``lthread_diagnostic_enable()`` supplying a callback function pointer and an
1214event mask.
1215
1216Setting ``LTHREAD_DIAG`` also enables counting of statistics about cache and
1217queue usage, and these statistics can be displayed by calling the function
1218``lthread_diag_stats_display()``. This function also performs a consistency
1219check on the caches and queues. The function should only be called from the
1220master EAL thread after all slave threads have stopped and returned to the C
1221main program, otherwise the consistency check will fail.