]> git.proxmox.com Git - mirror_frr.git/blame - doc/developer/process-architecture.rst
*: Convert thread_cancelXXX to event_cancelXXX
[mirror_frr.git] / doc / developer / process-architecture.rst
CommitLineData
2307575a
QY
1.. _process-architecture:
2
3Process Architecture
4====================
5
ad2af6ed
QY
6FRR is a suite of daemons that serve different functions. This document
7describes internal architecture of daemons, focusing their general design
8patterns, and especially how threads are used in the daemons that use them.
9
10Overview
11--------
12The fundamental pattern used in FRR daemons is an `event loop
13<https://en.wikipedia.org/wiki/Event_loop>`_. Some daemons use `kernel threads
14<https://en.wikipedia.org/wiki/Thread_(computing)#Kernel_threads>`_. In these
15daemons, each kernel thread runs its own event loop. The event loop
16implementation is constructed to be thread safe and to allow threads other than
17its owning thread to schedule events on it. The rest of this document describes
18these two designs in detail.
2307575a
QY
19
20Terminology
21-----------
ad2af6ed
QY
22Because this document describes the architecture for kernel threads as well as
23the event system, a digression on terminology is in order here.
2307575a 24
ad2af6ed 25Historically Quagga's loop system was viewed as an implementation of userspace
2307575a
QY
26threading. Because of this design choice, the names for various datastructures
27within the event system are variations on the term "thread". The primary
ad2af6ed
QY
28datastructure that holds the state of an event loop in this system is called a
29"threadmaster". Events scheduled on the event loop - what would today be called
30an 'event' or 'task' in systems such as libevent - are called "threads" and the
e6685141 31datastructure for them is ``struct event``. To add to the confusion, these
ad2af6ed
QY
32"threads" have various types, one of which is "event". To hopefully avoid some
33of this confusion, this document refers to these "threads" as a 'task' except
34where the datastructures are explicitly named. When they are explicitly named,
35they will be formatted ``like this`` to differentiate from the conceptual
36names. When speaking of kernel threads, the term used will be "pthread" since
37FRR's kernel threading implementation uses the POSIX threads API.
2307575a 38
c9efc8d3
QY
39.. This should be broken into its document under :ref:`libfrr`
40.. _event-architecture:
2307575a
QY
41
42Event Architecture
43------------------
44This section presents a brief overview of the event model as currently
45implemented in FRR. This doc should be expanded and broken off into its own
46section. For now it provides basic information necessary to understand the
47interplay between the event system and kernel threads.
48
49The core event system is implemented in :file:`lib/thread.[ch]`. The primary
50structure is ``struct thread_master``, hereafter referred to as a
51``threadmaster``. A ``threadmaster`` is a global state object, or context, that
52holds all the tasks currently pending execution as well as statistics on tasks
53that have already executed. The event system is driven by adding tasks to this
54data structure and then calling a function to retrieve the next task to
55execute. At initialization, a daemon will typically create one
56``threadmaster``, add a small set of initial tasks, and then run a loop to
57fetch each task and execute it.
58
59These tasks have various types corresponding to their general action. The types
cb37cb33 60are given by integer macros in :file:`event.h` and are:
2307575a
QY
61
62``THREAD_READ``
63 Task which waits for a file descriptor to become ready for reading and then
64 executes.
65
66``THREAD_WRITE``
67 Task which waits for a file descriptor to become ready for writing and then
68 executes.
69
70``THREAD_TIMER``
71 Task which executes after a certain amount of time has passed since it was
72 scheduled.
73
74``THREAD_EVENT``
75 Generic task that executes with high priority and carries an arbitrary
c9efc8d3
QY
76 integer indicating the event type to its handler. These are commonly used to
77 implement the finite state machines typically found in routing protocols.
2307575a
QY
78
79``THREAD_READY``
80 Type used internally for tasks on the ready queue.
81
82``THREAD_UNUSED``
e6685141
DS
83 Type used internally for ``struct event`` objects that aren't being used.
84 The event system pools ``struct event`` to avoid heap allocations; this is
2307575a
QY
85 the type they have when they're in the pool.
86
87``THREAD_EXECUTE``
88 Just before a task is run its type is changed to this. This is used to show
89 ``X`` as the type in the output of :clicmd:`show thread cpu`.
90
91The programmer never has to work with these types explicitly. Each type of task
92is created and queued via special-purpose functions (actually macros, but
93irrelevant for the time being) for the specific type. For example, to add a
94``THREAD_READ`` task, you would call
95
96::
97
907a2395 98 event_add_read(struct thread_master *master, int (*handler)(struct event *), void *arg, int fd, struct event **ref);
2307575a 99
e6685141 100The ``struct event`` is then created and added to the appropriate internal
94243261
MS
101datastructure within the ``threadmaster``. Note that the ``READ`` and
102``WRITE`` tasks are independent - a ``READ`` task only tests for
103readability, for example.
2307575a
QY
104
105The Event Loop
106^^^^^^^^^^^^^^
107To use the event system, after creating a ``threadmaster`` the program adds an
108initial set of tasks. As these tasks execute, they add more tasks that execute
109at some point in the future. This sequence of tasks drives the lifecycle of the
110program. When no more tasks are available, the program dies. Typically at
111startup the first task added is an I/O task for VTYSH as well as any network
112sockets needed for peerings or IPC.
113
114To retrieve the next task to run the program calls ``thread_fetch()``.
115``thread_fetch()`` internally computes which task to execute next based on
116rudimentary priority logic. Events (type ``THREAD_EVENT``) execute with the
117highest priority, followed by expired timers and finally I/O tasks (type
118``THREAD_READ`` and ``THREAD_WRITE``). When scheduling a task a function and an
119arbitrary argument are provided. The task returned from ``thread_fetch()`` is
120then executed with ``thread_call()``.
121
c9efc8d3
QY
122The following diagram illustrates a simplified version of this infrastructure.
123
124.. todo: replace these with SVG
125.. figure:: ../figures/threadmaster-single.png
126 :align: center
4c97fd1a 127
c9efc8d3
QY
128 Lifecycle of a program using a single threadmaster.
129
130The series of "task" boxes represents the current ready task queue. The various
131other queues for other types are not shown. The fetch-execute loop is
132illustrated at the bottom.
133
134Mapping the general names used in the figure to specific FRR functions:
135
e6685141 136- ``task`` is ``struct event *``
c9efc8d3
QY
137- ``fetch`` is ``thread_fetch()``
138- ``exec()`` is ``thread_call``
332beb64 139- ``cancel()`` is ``event_cancel()``
907a2395 140- ``schedule()`` is any of the various task-specific ``event_add_*`` functions
c9efc8d3 141
2307575a
QY
142Adding tasks is done with various task-specific function-like macros. These
143macros wrap underlying functions in :file:`thread.c` to provide additional
144information added at compile time, such as the line number the task was
145scheduled from, that can be accessed at runtime for debugging, logging and
146informational purposes. Each task type has its own specific scheduling function
907a2395 147that follow the naming convention ``event_add_<type>``; see :file:`event.h`
2307575a
QY
148for details.
149
150There are some gotchas to keep in mind:
151
152- I/O tasks are keyed off the file descriptor associated with the I/O
153 operation. This means that for any given file descriptor, only one of each
154 type of I/O task (``THREAD_READ`` and ``THREAD_WRITE``) can be scheduled. For
155 example, scheduling two write tasks one after the other will overwrite the
156 first task with the second, resulting in total loss of the first task and
157 difficult bugs.
158
159- Timer tasks are only as accurate as the monotonic clock provided by the
160 underlying operating system.
161
162- Memory management of the arbitrary handler argument passed in the schedule
163 call is the responsibility of the caller.
164
165
2307575a
QY
166Kernel Thread Architecture
167--------------------------
168Efforts have begun to introduce kernel threads into FRR to improve performance
169and stability. Naturally a kernel thread architecture has long been seen as
170orthogonal to an event-driven architecture, and the two do have significant
171overlap in terms of design choices. Since the event model is tightly integrated
172into FRR, careful thought has been put into how pthreads are introduced, what
173role they fill, and how they will interoperate with the event model.
174
175Design Overview
176^^^^^^^^^^^^^^^
177Each kernel thread behaves as a lightweight process within FRR, sharing the
178same process memory space. On the other hand, the event system is designed to
179run in a single process and drive serial execution of a set of tasks. With this
180consideration, a natural choice is to implement the event system within each
181kernel thread. This allows us to leverage the event-driven execution model with
182the currently existing task and context primitives. In this way the familiar
183execution model of FRR gains the ability to execute tasks simultaneously while
184preserving the existing model for concurrency.
185
c9efc8d3
QY
186The following figure illustrates the architecture with multiple pthreads, each
187running their own ``threadmaster``-based event loop.
188
189.. todo: replace these with SVG
190.. figure:: ../figures/threadmaster-multiple.png
191 :align: center
4c97fd1a 192
c9efc8d3
QY
193 Lifecycle of a program using multiple pthreads, each running their own
194 ``threadmaster``
195
196Each roundrect represents a single pthread running the same event loop
197described under :ref:`event-architecture`. Note the arrow from the ``exec()``
198box on the right to the ``schedule()`` box in the middle pthread. This
199illustrates code running in one pthread scheduling a task onto another
200pthread's threadmaster. A global lock for each ``threadmaster`` is used to
201synchronize these operations. The pthread names are examples.
202
203
204.. This should be broken into its document under :ref:`libfrr`
205.. _kernel-thread-wrapper:
206
2307575a
QY
207Kernel Thread Wrapper
208^^^^^^^^^^^^^^^^^^^^^
209The basis for the integration of pthreads and the event system is a lightweight
210wrapper for both systems implemented in :file:`lib/frr_pthread.[ch]`. The
211header provides a core datastructure, ``struct frr_pthread``, that encapsulates
212structures from both POSIX threads and :file:`thread.[ch]`. In particular, this
213datastructure has a pointer to a ``threadmaster`` that runs within the pthread.
214It also has fields for a name as well as start and stop functions that have
215signatures similar to the POSIX arguments for ``pthread_create()``.
216
217Calling ``frr_pthread_new()`` creates and registers a new ``frr_pthread``. The
218returned structure has a pre-initialized ``threadmaster``, and its ``start``
219and ``stop`` functions are initialized to defaults that will run a basic event
220loop with the given threadmaster. Calling ``frr_pthread_run`` starts the thread
221with the ``start`` function. From there, the model is the same as the regular
222event model. To schedule tasks on a particular pthread, simply use the regular
223:file:`thread.c` functions as usual and provide the ``threadmaster`` pointed to
224from the ``frr_pthread``. As part of implementing the wrapper, the
225:file:`thread.c` functions were made thread-safe. Consequently, it is safe to
226schedule events on a ``threadmaster`` belonging both to the calling thread as
227well as *any other pthread*. This serves as the basis for inter-thread
228communication and boils down to a slightly more complicated method of message
229passing, where the messages are the regular task events as used in the
230event-driven model. The only difference is thread cancellation, which requires
332beb64 231calling ``event_cancel_async()`` instead of ``event_cancel`` to cancel a task
2307575a 232currently scheduled on a ``threadmaster`` belonging to a different pthread.
c9efc8d3
QY
233This is necessary to avoid race conditions in the specific case where one
234pthread wants to guarantee that a task on another pthread is cancelled before
235proceeding.
2307575a
QY
236
237In addition, the existing commands to show statistics and other information for
238tasks within the event driven model have been expanded to handle multiple
239pthreads; running :clicmd:`show thread cpu` will display the usual event
240breakdown, but it will do so for each pthread running in the program. For
241example, :ref:`bgpd` runs a dedicated I/O pthread and shows the following
242output for :clicmd:`show thread cpu`:
243
244::
245
246 frr# show thread cpu
247
248 Thread statistics for bgpd:
249
250 Showing statistics for pthread main
251 ------------------------------------
252 CPU (user+system): Real (wall-clock):
253 Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
254 0 1389.000 10 138900 248000 135549 255349 T subgroup_coalesce_timer
255 0 0.000 1 0 0 18 18 T bgp_startup_timer_expire
256 0 850.000 18 47222 222000 47795 233814 T work_queue_run
257 0 0.000 10 0 0 6 14 T update_subgroup_merge_check_thread_cb
258 0 0.000 8 0 0 117 160 W zclient_flush_data
259 2 2.000 1 2000 2000 831 831 R bgp_accept
260 0 1.000 1 1000 1000 2832 2832 E zclient_connect
261 1 42082.000 240574 174 37000 178 72810 R vtysh_read
262 1 152.000 1885 80 2000 96 6292 R zclient_read
263 0 549346.000 2997298 183 7000 153 20242 E bgp_event
264 0 2120.000 300 7066 14000 6813 22046 T (bgp_holdtime_timer)
265 0 0.000 2 0 0 57 59 T update_group_refresh_default_originate_route_map
266 0 90.000 1 90000 90000 73729 73729 T bgp_route_map_update_timer
267 0 1417.000 9147 154 48000 132 61998 T bgp_process_packet
268 300 71807.000 2995200 23 3000 24 11066 T (bgp_connect_timer)
269 0 1894.000 12713 148 45000 112 33606 T (bgp_generate_updgrp_packets)
270 0 0.000 1 0 0 105 105 W vtysh_write
271 0 52.000 599 86 2000 138 6992 T (bgp_start_timer)
272 1 1.000 8 125 1000 164 593 R vtysh_accept
273 0 15.000 600 25 2000 15 153 T (bgp_routeadv_timer)
274 0 11.000 299 36 3000 53 3128 RW bgp_connect_check
275
276
277 Showing statistics for pthread BGP I/O thread
278 ----------------------------------------------
279 CPU (user+system): Real (wall-clock):
280 Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
281 0 1611.000 9296 173 13000 188 13685 R bgp_process_reads
282 0 2995.000 11753 254 26000 182 29355 W bgp_process_writes
283
284
285 Showing statistics for pthread BGP Keepalives thread
286 -----------------------------------------------------
287 CPU (user+system): Real (wall-clock):
288 Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
289 No data to display yet.
290
291Attentive readers will notice that there is a third thread, the Keepalives
292thread. This thread is responsible for -- surprise -- generating keepalives for
293peers. However, there are no statistics showing for that thread. Although the
294pthread uses the ``frr_pthread`` wrapper, it opts not to use the embedded
295``threadmaster`` facilities. Instead it replaces the ``start`` and ``stop``
296functions with custom functions. This was done because the ``threadmaster``
297facilities introduce a small but significant amount of overhead relative to the
298pthread's task. In this case since the pthread does not need the event-driven
299model and does not need to receive tasks from other pthreads, it is simpler and
300more efficient to implement it outside of the provided event facilities. The
301point to take away from this example is that while the facilities to make using
302pthreads within FRR easy are already implemented, the wrapper is flexible and
303allows usage of other models while still integrating with the rest of the FRR
304core infrastructure. Starting and stopping this pthread works the same as it
305does for any other ``frr_pthread``; the only difference is that event
306statistics are not collected for it, because there are no events.
307
308Notes on Design and Documentation
309---------------------------------
310Because of the choice to embed the existing event system into each pthread
311within FRR, at this time there is not integrated support for other models of
312pthread use such as divide and conquer. Similarly, there is no explicit support
313for thread pooling or similar higher level constructs. The currently existing
314infrastructure is designed around the concept of long-running worker threads
315responsible for specific jobs within each daemon. This is not to say that
316divide and conquer, thread pooling, etc. could not be implemented in the
317future. However, designs in this direction must be very careful to take into
318account the existing codebase. Introducing kernel threads into programs that
319have been written under the assumption of a single thread of execution must be
320done very carefully to avoid insidious errors and to ensure the program remains
321understandable and maintainable.
322
323In keeping with these goals, future work on kernel threading should be
324extensively documented here and FRR developers should be very careful with
325their design choices, as poor choices tightly integrated can prove to be
326catastrophic for development efforts in the future.