]>
Commit | Line | Data |
---|---|---|
2307575a QY |
1 | .. _process-architecture: |
2 | ||
3 | Process Architecture | |
4 | ==================== | |
5 | ||
ad2af6ed QY |
6 | FRR is a suite of daemons that serve different functions. This document |
7 | describes internal architecture of daemons, focusing their general design | |
8 | patterns, and especially how threads are used in the daemons that use them. | |
9 | ||
10 | Overview | |
11 | -------- | |
12 | The fundamental pattern used in FRR daemons is an `event loop | |
13 | <https://en.wikipedia.org/wiki/Event_loop>`_. Some daemons use `kernel threads | |
14 | <https://en.wikipedia.org/wiki/Thread_(computing)#Kernel_threads>`_. In these | |
15 | daemons, each kernel thread runs its own event loop. The event loop | |
16 | implementation is constructed to be thread safe and to allow threads other than | |
17 | its owning thread to schedule events on it. The rest of this document describes | |
18 | these two designs in detail. | |
2307575a QY |
19 | |
20 | Terminology | |
21 | ----------- | |
ad2af6ed QY |
22 | Because this document describes the architecture for kernel threads as well as |
23 | the event system, a digression on terminology is in order here. | |
2307575a | 24 | |
ad2af6ed | 25 | Historically Quagga's loop system was viewed as an implementation of userspace |
2307575a QY |
26 | threading. Because of this design choice, the names for various datastructures |
27 | within the event system are variations on the term "thread". The primary | |
ad2af6ed QY |
28 | datastructure that holds the state of an event loop in this system is called a |
29 | "threadmaster". Events scheduled on the event loop - what would today be called | |
30 | an 'event' or 'task' in systems such as libevent - are called "threads" and the | |
e6685141 | 31 | datastructure for them is ``struct event``. To add to the confusion, these |
ad2af6ed QY |
32 | "threads" have various types, one of which is "event". To hopefully avoid some |
33 | of this confusion, this document refers to these "threads" as a 'task' except | |
34 | where the datastructures are explicitly named. When they are explicitly named, | |
35 | they will be formatted ``like this`` to differentiate from the conceptual | |
36 | names. When speaking of kernel threads, the term used will be "pthread" since | |
37 | FRR's kernel threading implementation uses the POSIX threads API. | |
2307575a | 38 | |
c9efc8d3 QY |
39 | .. This should be broken into its document under :ref:`libfrr` |
40 | .. _event-architecture: | |
2307575a QY |
41 | |
42 | Event Architecture | |
43 | ------------------ | |
44 | This section presents a brief overview of the event model as currently | |
45 | implemented in FRR. This doc should be expanded and broken off into its own | |
46 | section. For now it provides basic information necessary to understand the | |
47 | interplay between the event system and kernel threads. | |
48 | ||
49 | The core event system is implemented in :file:`lib/thread.[ch]`. The primary | |
50 | structure is ``struct thread_master``, hereafter referred to as a | |
51 | ``threadmaster``. A ``threadmaster`` is a global state object, or context, that | |
52 | holds all the tasks currently pending execution as well as statistics on tasks | |
53 | that have already executed. The event system is driven by adding tasks to this | |
54 | data structure and then calling a function to retrieve the next task to | |
55 | execute. At initialization, a daemon will typically create one | |
56 | ``threadmaster``, add a small set of initial tasks, and then run a loop to | |
57 | fetch each task and execute it. | |
58 | ||
59 | These tasks have various types corresponding to their general action. The types | |
cb37cb33 | 60 | are given by integer macros in :file:`event.h` and are: |
2307575a QY |
61 | |
62 | ``THREAD_READ`` | |
63 | Task which waits for a file descriptor to become ready for reading and then | |
64 | executes. | |
65 | ||
66 | ``THREAD_WRITE`` | |
67 | Task which waits for a file descriptor to become ready for writing and then | |
68 | executes. | |
69 | ||
70 | ``THREAD_TIMER`` | |
71 | Task which executes after a certain amount of time has passed since it was | |
72 | scheduled. | |
73 | ||
74 | ``THREAD_EVENT`` | |
75 | Generic task that executes with high priority and carries an arbitrary | |
c9efc8d3 QY |
76 | integer indicating the event type to its handler. These are commonly used to |
77 | implement the finite state machines typically found in routing protocols. | |
2307575a QY |
78 | |
79 | ``THREAD_READY`` | |
80 | Type used internally for tasks on the ready queue. | |
81 | ||
82 | ``THREAD_UNUSED`` | |
e6685141 DS |
83 | Type used internally for ``struct event`` objects that aren't being used. |
84 | The event system pools ``struct event`` to avoid heap allocations; this is | |
2307575a QY |
85 | the type they have when they're in the pool. |
86 | ||
87 | ``THREAD_EXECUTE`` | |
88 | Just before a task is run its type is changed to this. This is used to show | |
89 | ``X`` as the type in the output of :clicmd:`show thread cpu`. | |
90 | ||
91 | The programmer never has to work with these types explicitly. Each type of task | |
92 | is created and queued via special-purpose functions (actually macros, but | |
93 | irrelevant for the time being) for the specific type. For example, to add a | |
94 | ``THREAD_READ`` task, you would call | |
95 | ||
96 | :: | |
97 | ||
907a2395 | 98 | event_add_read(struct thread_master *master, int (*handler)(struct event *), void *arg, int fd, struct event **ref); |
2307575a | 99 | |
e6685141 | 100 | The ``struct event`` is then created and added to the appropriate internal |
94243261 MS |
101 | datastructure within the ``threadmaster``. Note that the ``READ`` and |
102 | ``WRITE`` tasks are independent - a ``READ`` task only tests for | |
103 | readability, for example. | |
2307575a QY |
104 | |
105 | The Event Loop | |
106 | ^^^^^^^^^^^^^^ | |
107 | To use the event system, after creating a ``threadmaster`` the program adds an | |
108 | initial set of tasks. As these tasks execute, they add more tasks that execute | |
109 | at some point in the future. This sequence of tasks drives the lifecycle of the | |
110 | program. When no more tasks are available, the program dies. Typically at | |
111 | startup the first task added is an I/O task for VTYSH as well as any network | |
112 | sockets needed for peerings or IPC. | |
113 | ||
114 | To retrieve the next task to run the program calls ``thread_fetch()``. | |
115 | ``thread_fetch()`` internally computes which task to execute next based on | |
116 | rudimentary priority logic. Events (type ``THREAD_EVENT``) execute with the | |
117 | highest priority, followed by expired timers and finally I/O tasks (type | |
118 | ``THREAD_READ`` and ``THREAD_WRITE``). When scheduling a task a function and an | |
119 | arbitrary argument are provided. The task returned from ``thread_fetch()`` is | |
120 | then executed with ``thread_call()``. | |
121 | ||
c9efc8d3 QY |
122 | The following diagram illustrates a simplified version of this infrastructure. |
123 | ||
124 | .. todo: replace these with SVG | |
125 | .. figure:: ../figures/threadmaster-single.png | |
126 | :align: center | |
4c97fd1a | 127 | |
c9efc8d3 QY |
128 | Lifecycle of a program using a single threadmaster. |
129 | ||
130 | The series of "task" boxes represents the current ready task queue. The various | |
131 | other queues for other types are not shown. The fetch-execute loop is | |
132 | illustrated at the bottom. | |
133 | ||
134 | Mapping the general names used in the figure to specific FRR functions: | |
135 | ||
e6685141 | 136 | - ``task`` is ``struct event *`` |
c9efc8d3 QY |
137 | - ``fetch`` is ``thread_fetch()`` |
138 | - ``exec()`` is ``thread_call`` | |
332beb64 | 139 | - ``cancel()`` is ``event_cancel()`` |
907a2395 | 140 | - ``schedule()`` is any of the various task-specific ``event_add_*`` functions |
c9efc8d3 | 141 | |
2307575a QY |
142 | Adding tasks is done with various task-specific function-like macros. These |
143 | macros wrap underlying functions in :file:`thread.c` to provide additional | |
144 | information added at compile time, such as the line number the task was | |
145 | scheduled from, that can be accessed at runtime for debugging, logging and | |
146 | informational purposes. Each task type has its own specific scheduling function | |
907a2395 | 147 | that follow the naming convention ``event_add_<type>``; see :file:`event.h` |
2307575a QY |
148 | for details. |
149 | ||
150 | There are some gotchas to keep in mind: | |
151 | ||
152 | - I/O tasks are keyed off the file descriptor associated with the I/O | |
153 | operation. This means that for any given file descriptor, only one of each | |
154 | type of I/O task (``THREAD_READ`` and ``THREAD_WRITE``) can be scheduled. For | |
155 | example, scheduling two write tasks one after the other will overwrite the | |
156 | first task with the second, resulting in total loss of the first task and | |
157 | difficult bugs. | |
158 | ||
159 | - Timer tasks are only as accurate as the monotonic clock provided by the | |
160 | underlying operating system. | |
161 | ||
162 | - Memory management of the arbitrary handler argument passed in the schedule | |
163 | call is the responsibility of the caller. | |
164 | ||
165 | ||
2307575a QY |
166 | Kernel Thread Architecture |
167 | -------------------------- | |
168 | Efforts have begun to introduce kernel threads into FRR to improve performance | |
169 | and stability. Naturally a kernel thread architecture has long been seen as | |
170 | orthogonal to an event-driven architecture, and the two do have significant | |
171 | overlap in terms of design choices. Since the event model is tightly integrated | |
172 | into FRR, careful thought has been put into how pthreads are introduced, what | |
173 | role they fill, and how they will interoperate with the event model. | |
174 | ||
175 | Design Overview | |
176 | ^^^^^^^^^^^^^^^ | |
177 | Each kernel thread behaves as a lightweight process within FRR, sharing the | |
178 | same process memory space. On the other hand, the event system is designed to | |
179 | run in a single process and drive serial execution of a set of tasks. With this | |
180 | consideration, a natural choice is to implement the event system within each | |
181 | kernel thread. This allows us to leverage the event-driven execution model with | |
182 | the currently existing task and context primitives. In this way the familiar | |
183 | execution model of FRR gains the ability to execute tasks simultaneously while | |
184 | preserving the existing model for concurrency. | |
185 | ||
c9efc8d3 QY |
186 | The following figure illustrates the architecture with multiple pthreads, each |
187 | running their own ``threadmaster``-based event loop. | |
188 | ||
189 | .. todo: replace these with SVG | |
190 | .. figure:: ../figures/threadmaster-multiple.png | |
191 | :align: center | |
4c97fd1a | 192 | |
c9efc8d3 QY |
193 | Lifecycle of a program using multiple pthreads, each running their own |
194 | ``threadmaster`` | |
195 | ||
196 | Each roundrect represents a single pthread running the same event loop | |
197 | described under :ref:`event-architecture`. Note the arrow from the ``exec()`` | |
198 | box on the right to the ``schedule()`` box in the middle pthread. This | |
199 | illustrates code running in one pthread scheduling a task onto another | |
200 | pthread's threadmaster. A global lock for each ``threadmaster`` is used to | |
201 | synchronize these operations. The pthread names are examples. | |
202 | ||
203 | ||
204 | .. This should be broken into its document under :ref:`libfrr` | |
205 | .. _kernel-thread-wrapper: | |
206 | ||
2307575a QY |
207 | Kernel Thread Wrapper |
208 | ^^^^^^^^^^^^^^^^^^^^^ | |
209 | The basis for the integration of pthreads and the event system is a lightweight | |
210 | wrapper for both systems implemented in :file:`lib/frr_pthread.[ch]`. The | |
211 | header provides a core datastructure, ``struct frr_pthread``, that encapsulates | |
212 | structures from both POSIX threads and :file:`thread.[ch]`. In particular, this | |
213 | datastructure has a pointer to a ``threadmaster`` that runs within the pthread. | |
214 | It also has fields for a name as well as start and stop functions that have | |
215 | signatures similar to the POSIX arguments for ``pthread_create()``. | |
216 | ||
217 | Calling ``frr_pthread_new()`` creates and registers a new ``frr_pthread``. The | |
218 | returned structure has a pre-initialized ``threadmaster``, and its ``start`` | |
219 | and ``stop`` functions are initialized to defaults that will run a basic event | |
220 | loop with the given threadmaster. Calling ``frr_pthread_run`` starts the thread | |
221 | with the ``start`` function. From there, the model is the same as the regular | |
222 | event model. To schedule tasks on a particular pthread, simply use the regular | |
223 | :file:`thread.c` functions as usual and provide the ``threadmaster`` pointed to | |
224 | from the ``frr_pthread``. As part of implementing the wrapper, the | |
225 | :file:`thread.c` functions were made thread-safe. Consequently, it is safe to | |
226 | schedule events on a ``threadmaster`` belonging both to the calling thread as | |
227 | well as *any other pthread*. This serves as the basis for inter-thread | |
228 | communication and boils down to a slightly more complicated method of message | |
229 | passing, where the messages are the regular task events as used in the | |
230 | event-driven model. The only difference is thread cancellation, which requires | |
332beb64 | 231 | calling ``event_cancel_async()`` instead of ``event_cancel`` to cancel a task |
2307575a | 232 | currently scheduled on a ``threadmaster`` belonging to a different pthread. |
c9efc8d3 QY |
233 | This is necessary to avoid race conditions in the specific case where one |
234 | pthread wants to guarantee that a task on another pthread is cancelled before | |
235 | proceeding. | |
2307575a QY |
236 | |
237 | In addition, the existing commands to show statistics and other information for | |
238 | tasks within the event driven model have been expanded to handle multiple | |
239 | pthreads; running :clicmd:`show thread cpu` will display the usual event | |
240 | breakdown, but it will do so for each pthread running in the program. For | |
241 | example, :ref:`bgpd` runs a dedicated I/O pthread and shows the following | |
242 | output for :clicmd:`show thread cpu`: | |
243 | ||
244 | :: | |
245 | ||
246 | frr# show thread cpu | |
247 | ||
248 | Thread statistics for bgpd: | |
249 | ||
250 | Showing statistics for pthread main | |
251 | ------------------------------------ | |
252 | CPU (user+system): Real (wall-clock): | |
253 | Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread | |
254 | 0 1389.000 10 138900 248000 135549 255349 T subgroup_coalesce_timer | |
255 | 0 0.000 1 0 0 18 18 T bgp_startup_timer_expire | |
256 | 0 850.000 18 47222 222000 47795 233814 T work_queue_run | |
257 | 0 0.000 10 0 0 6 14 T update_subgroup_merge_check_thread_cb | |
258 | 0 0.000 8 0 0 117 160 W zclient_flush_data | |
259 | 2 2.000 1 2000 2000 831 831 R bgp_accept | |
260 | 0 1.000 1 1000 1000 2832 2832 E zclient_connect | |
261 | 1 42082.000 240574 174 37000 178 72810 R vtysh_read | |
262 | 1 152.000 1885 80 2000 96 6292 R zclient_read | |
263 | 0 549346.000 2997298 183 7000 153 20242 E bgp_event | |
264 | 0 2120.000 300 7066 14000 6813 22046 T (bgp_holdtime_timer) | |
265 | 0 0.000 2 0 0 57 59 T update_group_refresh_default_originate_route_map | |
266 | 0 90.000 1 90000 90000 73729 73729 T bgp_route_map_update_timer | |
267 | 0 1417.000 9147 154 48000 132 61998 T bgp_process_packet | |
268 | 300 71807.000 2995200 23 3000 24 11066 T (bgp_connect_timer) | |
269 | 0 1894.000 12713 148 45000 112 33606 T (bgp_generate_updgrp_packets) | |
270 | 0 0.000 1 0 0 105 105 W vtysh_write | |
271 | 0 52.000 599 86 2000 138 6992 T (bgp_start_timer) | |
272 | 1 1.000 8 125 1000 164 593 R vtysh_accept | |
273 | 0 15.000 600 25 2000 15 153 T (bgp_routeadv_timer) | |
274 | 0 11.000 299 36 3000 53 3128 RW bgp_connect_check | |
275 | ||
276 | ||
277 | Showing statistics for pthread BGP I/O thread | |
278 | ---------------------------------------------- | |
279 | CPU (user+system): Real (wall-clock): | |
280 | Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread | |
281 | 0 1611.000 9296 173 13000 188 13685 R bgp_process_reads | |
282 | 0 2995.000 11753 254 26000 182 29355 W bgp_process_writes | |
283 | ||
284 | ||
285 | Showing statistics for pthread BGP Keepalives thread | |
286 | ----------------------------------------------------- | |
287 | CPU (user+system): Real (wall-clock): | |
288 | Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread | |
289 | No data to display yet. | |
290 | ||
291 | Attentive readers will notice that there is a third thread, the Keepalives | |
292 | thread. This thread is responsible for -- surprise -- generating keepalives for | |
293 | peers. However, there are no statistics showing for that thread. Although the | |
294 | pthread uses the ``frr_pthread`` wrapper, it opts not to use the embedded | |
295 | ``threadmaster`` facilities. Instead it replaces the ``start`` and ``stop`` | |
296 | functions with custom functions. This was done because the ``threadmaster`` | |
297 | facilities introduce a small but significant amount of overhead relative to the | |
298 | pthread's task. In this case since the pthread does not need the event-driven | |
299 | model and does not need to receive tasks from other pthreads, it is simpler and | |
300 | more efficient to implement it outside of the provided event facilities. The | |
301 | point to take away from this example is that while the facilities to make using | |
302 | pthreads within FRR easy are already implemented, the wrapper is flexible and | |
303 | allows usage of other models while still integrating with the rest of the FRR | |
304 | core infrastructure. Starting and stopping this pthread works the same as it | |
305 | does for any other ``frr_pthread``; the only difference is that event | |
306 | statistics are not collected for it, because there are no events. | |
307 | ||
308 | Notes on Design and Documentation | |
309 | --------------------------------- | |
310 | Because of the choice to embed the existing event system into each pthread | |
311 | within FRR, at this time there is not integrated support for other models of | |
312 | pthread use such as divide and conquer. Similarly, there is no explicit support | |
313 | for thread pooling or similar higher level constructs. The currently existing | |
314 | infrastructure is designed around the concept of long-running worker threads | |
315 | responsible for specific jobs within each daemon. This is not to say that | |
316 | divide and conquer, thread pooling, etc. could not be implemented in the | |
317 | future. However, designs in this direction must be very careful to take into | |
318 | account the existing codebase. Introducing kernel threads into programs that | |
319 | have been written under the assumption of a single thread of execution must be | |
320 | done very carefully to avoid insidious errors and to ensure the program remains | |
321 | understandable and maintainable. | |
322 | ||
323 | In keeping with these goals, future work on kernel threading should be | |
324 | extensively documented here and FRR developers should be very careful with | |
325 | their design choices, as poor choices tightly integrated can prove to be | |
326 | catastrophic for development efforts in the future. |