]>
Commit | Line | Data |
---|---|---|
2307575a QY |
1 | .. _process-architecture: |
2 | ||
3 | Process Architecture | |
4 | ==================== | |
5 | ||
6 | FRR inherited its overall design architecture from Quagga. The chosen model for | |
7 | Quagga is that of a suite of independent daemons that do IPC via Unix domain | |
8 | sockets. Within each daemon, the architecture follows the event-driven model. | |
9 | FRR has inherited this model as well. As FRR is deployed at larger scales and | |
10 | gains ever more features, each adding to the overall processing workload, we | |
11 | are approaching the saturation point for a single thread per daemon. In light | |
12 | of this, there are ongoing efforts to introduce multithreading to various | |
13 | components of FRR. This document aims to describe the current design choices | |
14 | and overall model for integrating the event-driven and multithreaded | |
15 | architectures into a cohesive whole. | |
16 | ||
17 | Terminology | |
18 | ----------- | |
19 | Because this document describes the architecture for true kernel threads as | |
20 | well as the event system, a digression on terminology is in order here. | |
21 | ||
22 | Historically Quagga's event system was viewed as an implementation of userspace | |
23 | threading. Because of this design choice, the names for various datastructures | |
24 | within the event system are variations on the term "thread". The primary | |
25 | context datastructure in this system is called a "threadmaster". What would | |
26 | today be called an 'event' or 'task' in systems such as libevent are called | |
27 | "threads" and the datastructure for them is ``struct thread``. To add to the | |
28 | confusion, these "threads" have various types, one of which is "event". To | |
29 | hopefully avoid some of this confusion, this document refers to these "threads" | |
30 | as a 'task' except where the datastructures are explicitly named. When they are | |
31 | explicitly named, they will be formatted ``like this`` to differentiate from | |
32 | the conceptual names. When speaking of kernel threads, the term used will be | |
33 | "pthread" since FRR's kernel threading implementation is POSIX threads. | |
34 | ||
c9efc8d3 QY |
35 | .. This should be broken into its document under :ref:`libfrr` |
36 | .. _event-architecture: | |
2307575a QY |
37 | |
38 | Event Architecture | |
39 | ------------------ | |
40 | This section presents a brief overview of the event model as currently | |
41 | implemented in FRR. This doc should be expanded and broken off into its own | |
42 | section. For now it provides basic information necessary to understand the | |
43 | interplay between the event system and kernel threads. | |
44 | ||
45 | The core event system is implemented in :file:`lib/thread.[ch]`. The primary | |
46 | structure is ``struct thread_master``, hereafter referred to as a | |
47 | ``threadmaster``. A ``threadmaster`` is a global state object, or context, that | |
48 | holds all the tasks currently pending execution as well as statistics on tasks | |
49 | that have already executed. The event system is driven by adding tasks to this | |
50 | data structure and then calling a function to retrieve the next task to | |
51 | execute. At initialization, a daemon will typically create one | |
52 | ``threadmaster``, add a small set of initial tasks, and then run a loop to | |
53 | fetch each task and execute it. | |
54 | ||
55 | These tasks have various types corresponding to their general action. The types | |
56 | are given by integer macros in :file:`thread.h` and are: | |
57 | ||
58 | ``THREAD_READ`` | |
59 | Task which waits for a file descriptor to become ready for reading and then | |
60 | executes. | |
61 | ||
62 | ``THREAD_WRITE`` | |
63 | Task which waits for a file descriptor to become ready for writing and then | |
64 | executes. | |
65 | ||
66 | ``THREAD_TIMER`` | |
67 | Task which executes after a certain amount of time has passed since it was | |
68 | scheduled. | |
69 | ||
70 | ``THREAD_EVENT`` | |
71 | Generic task that executes with high priority and carries an arbitrary | |
c9efc8d3 QY |
72 | integer indicating the event type to its handler. These are commonly used to |
73 | implement the finite state machines typically found in routing protocols. | |
2307575a QY |
74 | |
75 | ``THREAD_READY`` | |
76 | Type used internally for tasks on the ready queue. | |
77 | ||
78 | ``THREAD_UNUSED`` | |
79 | Type used internally for ``struct thread`` objects that aren't being used. | |
80 | The event system pools ``struct thread`` to avoid heap allocations; this is | |
81 | the type they have when they're in the pool. | |
82 | ||
83 | ``THREAD_EXECUTE`` | |
84 | Just before a task is run its type is changed to this. This is used to show | |
85 | ``X`` as the type in the output of :clicmd:`show thread cpu`. | |
86 | ||
87 | The programmer never has to work with these types explicitly. Each type of task | |
88 | is created and queued via special-purpose functions (actually macros, but | |
89 | irrelevant for the time being) for the specific type. For example, to add a | |
90 | ``THREAD_READ`` task, you would call | |
91 | ||
92 | :: | |
93 | ||
94 | thread_add_read(struct thread_master *master, int (*handler)(struct thread *), void *arg, int fd, struct thread **ref); | |
95 | ||
96 | The ``struct thread`` is then created and added to the appropriate internal | |
97 | datastructure within the ``threadmaster``. | |
98 | ||
99 | The Event Loop | |
100 | ^^^^^^^^^^^^^^ | |
101 | To use the event system, after creating a ``threadmaster`` the program adds an | |
102 | initial set of tasks. As these tasks execute, they add more tasks that execute | |
103 | at some point in the future. This sequence of tasks drives the lifecycle of the | |
104 | program. When no more tasks are available, the program dies. Typically at | |
105 | startup the first task added is an I/O task for VTYSH as well as any network | |
106 | sockets needed for peerings or IPC. | |
107 | ||
108 | To retrieve the next task to run the program calls ``thread_fetch()``. | |
109 | ``thread_fetch()`` internally computes which task to execute next based on | |
110 | rudimentary priority logic. Events (type ``THREAD_EVENT``) execute with the | |
111 | highest priority, followed by expired timers and finally I/O tasks (type | |
112 | ``THREAD_READ`` and ``THREAD_WRITE``). When scheduling a task a function and an | |
113 | arbitrary argument are provided. The task returned from ``thread_fetch()`` is | |
114 | then executed with ``thread_call()``. | |
115 | ||
c9efc8d3 QY |
116 | The following diagram illustrates a simplified version of this infrastructure. |
117 | ||
118 | .. todo: replace these with SVG | |
119 | .. figure:: ../figures/threadmaster-single.png | |
120 | :align: center | |
4c97fd1a | 121 | |
c9efc8d3 QY |
122 | Lifecycle of a program using a single threadmaster. |
123 | ||
124 | The series of "task" boxes represents the current ready task queue. The various | |
125 | other queues for other types are not shown. The fetch-execute loop is | |
126 | illustrated at the bottom. | |
127 | ||
128 | Mapping the general names used in the figure to specific FRR functions: | |
129 | ||
130 | - ``task`` is ``struct thread *`` | |
131 | - ``fetch`` is ``thread_fetch()`` | |
132 | - ``exec()`` is ``thread_call`` | |
133 | - ``cancel()`` is ``thread_cancel()`` | |
134 | - ``schedule()`` is any of the various task-specific ``thread_add_*`` functions | |
135 | ||
2307575a QY |
136 | Adding tasks is done with various task-specific function-like macros. These |
137 | macros wrap underlying functions in :file:`thread.c` to provide additional | |
138 | information added at compile time, such as the line number the task was | |
139 | scheduled from, that can be accessed at runtime for debugging, logging and | |
140 | informational purposes. Each task type has its own specific scheduling function | |
141 | that follow the naming convention ``thread_add_<type>``; see :file:`thread.h` | |
142 | for details. | |
143 | ||
144 | There are some gotchas to keep in mind: | |
145 | ||
146 | - I/O tasks are keyed off the file descriptor associated with the I/O | |
147 | operation. This means that for any given file descriptor, only one of each | |
148 | type of I/O task (``THREAD_READ`` and ``THREAD_WRITE``) can be scheduled. For | |
149 | example, scheduling two write tasks one after the other will overwrite the | |
150 | first task with the second, resulting in total loss of the first task and | |
151 | difficult bugs. | |
152 | ||
153 | - Timer tasks are only as accurate as the monotonic clock provided by the | |
154 | underlying operating system. | |
155 | ||
156 | - Memory management of the arbitrary handler argument passed in the schedule | |
157 | call is the responsibility of the caller. | |
158 | ||
159 | ||
2307575a QY |
160 | Kernel Thread Architecture |
161 | -------------------------- | |
162 | Efforts have begun to introduce kernel threads into FRR to improve performance | |
163 | and stability. Naturally a kernel thread architecture has long been seen as | |
164 | orthogonal to an event-driven architecture, and the two do have significant | |
165 | overlap in terms of design choices. Since the event model is tightly integrated | |
166 | into FRR, careful thought has been put into how pthreads are introduced, what | |
167 | role they fill, and how they will interoperate with the event model. | |
168 | ||
169 | Design Overview | |
170 | ^^^^^^^^^^^^^^^ | |
171 | Each kernel thread behaves as a lightweight process within FRR, sharing the | |
172 | same process memory space. On the other hand, the event system is designed to | |
173 | run in a single process and drive serial execution of a set of tasks. With this | |
174 | consideration, a natural choice is to implement the event system within each | |
175 | kernel thread. This allows us to leverage the event-driven execution model with | |
176 | the currently existing task and context primitives. In this way the familiar | |
177 | execution model of FRR gains the ability to execute tasks simultaneously while | |
178 | preserving the existing model for concurrency. | |
179 | ||
c9efc8d3 QY |
180 | The following figure illustrates the architecture with multiple pthreads, each |
181 | running their own ``threadmaster``-based event loop. | |
182 | ||
183 | .. todo: replace these with SVG | |
184 | .. figure:: ../figures/threadmaster-multiple.png | |
185 | :align: center | |
4c97fd1a | 186 | |
c9efc8d3 QY |
187 | Lifecycle of a program using multiple pthreads, each running their own |
188 | ``threadmaster`` | |
189 | ||
190 | Each roundrect represents a single pthread running the same event loop | |
191 | described under :ref:`event-architecture`. Note the arrow from the ``exec()`` | |
192 | box on the right to the ``schedule()`` box in the middle pthread. This | |
193 | illustrates code running in one pthread scheduling a task onto another | |
194 | pthread's threadmaster. A global lock for each ``threadmaster`` is used to | |
195 | synchronize these operations. The pthread names are examples. | |
196 | ||
197 | ||
198 | .. This should be broken into its document under :ref:`libfrr` | |
199 | .. _kernel-thread-wrapper: | |
200 | ||
2307575a QY |
201 | Kernel Thread Wrapper |
202 | ^^^^^^^^^^^^^^^^^^^^^ | |
203 | The basis for the integration of pthreads and the event system is a lightweight | |
204 | wrapper for both systems implemented in :file:`lib/frr_pthread.[ch]`. The | |
205 | header provides a core datastructure, ``struct frr_pthread``, that encapsulates | |
206 | structures from both POSIX threads and :file:`thread.[ch]`. In particular, this | |
207 | datastructure has a pointer to a ``threadmaster`` that runs within the pthread. | |
208 | It also has fields for a name as well as start and stop functions that have | |
209 | signatures similar to the POSIX arguments for ``pthread_create()``. | |
210 | ||
211 | Calling ``frr_pthread_new()`` creates and registers a new ``frr_pthread``. The | |
212 | returned structure has a pre-initialized ``threadmaster``, and its ``start`` | |
213 | and ``stop`` functions are initialized to defaults that will run a basic event | |
214 | loop with the given threadmaster. Calling ``frr_pthread_run`` starts the thread | |
215 | with the ``start`` function. From there, the model is the same as the regular | |
216 | event model. To schedule tasks on a particular pthread, simply use the regular | |
217 | :file:`thread.c` functions as usual and provide the ``threadmaster`` pointed to | |
218 | from the ``frr_pthread``. As part of implementing the wrapper, the | |
219 | :file:`thread.c` functions were made thread-safe. Consequently, it is safe to | |
220 | schedule events on a ``threadmaster`` belonging both to the calling thread as | |
221 | well as *any other pthread*. This serves as the basis for inter-thread | |
222 | communication and boils down to a slightly more complicated method of message | |
223 | passing, where the messages are the regular task events as used in the | |
224 | event-driven model. The only difference is thread cancellation, which requires | |
225 | calling ``thread_cancel_async()`` instead of ``thread_cancel`` to cancel a task | |
226 | currently scheduled on a ``threadmaster`` belonging to a different pthread. | |
c9efc8d3 QY |
227 | This is necessary to avoid race conditions in the specific case where one |
228 | pthread wants to guarantee that a task on another pthread is cancelled before | |
229 | proceeding. | |
2307575a QY |
230 | |
231 | In addition, the existing commands to show statistics and other information for | |
232 | tasks within the event driven model have been expanded to handle multiple | |
233 | pthreads; running :clicmd:`show thread cpu` will display the usual event | |
234 | breakdown, but it will do so for each pthread running in the program. For | |
235 | example, :ref:`bgpd` runs a dedicated I/O pthread and shows the following | |
236 | output for :clicmd:`show thread cpu`: | |
237 | ||
238 | :: | |
239 | ||
240 | frr# show thread cpu | |
241 | ||
242 | Thread statistics for bgpd: | |
243 | ||
244 | Showing statistics for pthread main | |
245 | ------------------------------------ | |
246 | CPU (user+system): Real (wall-clock): | |
247 | Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread | |
248 | 0 1389.000 10 138900 248000 135549 255349 T subgroup_coalesce_timer | |
249 | 0 0.000 1 0 0 18 18 T bgp_startup_timer_expire | |
250 | 0 850.000 18 47222 222000 47795 233814 T work_queue_run | |
251 | 0 0.000 10 0 0 6 14 T update_subgroup_merge_check_thread_cb | |
252 | 0 0.000 8 0 0 117 160 W zclient_flush_data | |
253 | 2 2.000 1 2000 2000 831 831 R bgp_accept | |
254 | 0 1.000 1 1000 1000 2832 2832 E zclient_connect | |
255 | 1 42082.000 240574 174 37000 178 72810 R vtysh_read | |
256 | 1 152.000 1885 80 2000 96 6292 R zclient_read | |
257 | 0 549346.000 2997298 183 7000 153 20242 E bgp_event | |
258 | 0 2120.000 300 7066 14000 6813 22046 T (bgp_holdtime_timer) | |
259 | 0 0.000 2 0 0 57 59 T update_group_refresh_default_originate_route_map | |
260 | 0 90.000 1 90000 90000 73729 73729 T bgp_route_map_update_timer | |
261 | 0 1417.000 9147 154 48000 132 61998 T bgp_process_packet | |
262 | 300 71807.000 2995200 23 3000 24 11066 T (bgp_connect_timer) | |
263 | 0 1894.000 12713 148 45000 112 33606 T (bgp_generate_updgrp_packets) | |
264 | 0 0.000 1 0 0 105 105 W vtysh_write | |
265 | 0 52.000 599 86 2000 138 6992 T (bgp_start_timer) | |
266 | 1 1.000 8 125 1000 164 593 R vtysh_accept | |
267 | 0 15.000 600 25 2000 15 153 T (bgp_routeadv_timer) | |
268 | 0 11.000 299 36 3000 53 3128 RW bgp_connect_check | |
269 | ||
270 | ||
271 | Showing statistics for pthread BGP I/O thread | |
272 | ---------------------------------------------- | |
273 | CPU (user+system): Real (wall-clock): | |
274 | Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread | |
275 | 0 1611.000 9296 173 13000 188 13685 R bgp_process_reads | |
276 | 0 2995.000 11753 254 26000 182 29355 W bgp_process_writes | |
277 | ||
278 | ||
279 | Showing statistics for pthread BGP Keepalives thread | |
280 | ----------------------------------------------------- | |
281 | CPU (user+system): Real (wall-clock): | |
282 | Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread | |
283 | No data to display yet. | |
284 | ||
285 | Attentive readers will notice that there is a third thread, the Keepalives | |
286 | thread. This thread is responsible for -- surprise -- generating keepalives for | |
287 | peers. However, there are no statistics showing for that thread. Although the | |
288 | pthread uses the ``frr_pthread`` wrapper, it opts not to use the embedded | |
289 | ``threadmaster`` facilities. Instead it replaces the ``start`` and ``stop`` | |
290 | functions with custom functions. This was done because the ``threadmaster`` | |
291 | facilities introduce a small but significant amount of overhead relative to the | |
292 | pthread's task. In this case since the pthread does not need the event-driven | |
293 | model and does not need to receive tasks from other pthreads, it is simpler and | |
294 | more efficient to implement it outside of the provided event facilities. The | |
295 | point to take away from this example is that while the facilities to make using | |
296 | pthreads within FRR easy are already implemented, the wrapper is flexible and | |
297 | allows usage of other models while still integrating with the rest of the FRR | |
298 | core infrastructure. Starting and stopping this pthread works the same as it | |
299 | does for any other ``frr_pthread``; the only difference is that event | |
300 | statistics are not collected for it, because there are no events. | |
301 | ||
302 | Notes on Design and Documentation | |
303 | --------------------------------- | |
304 | Because of the choice to embed the existing event system into each pthread | |
305 | within FRR, at this time there is not integrated support for other models of | |
306 | pthread use such as divide and conquer. Similarly, there is no explicit support | |
307 | for thread pooling or similar higher level constructs. The currently existing | |
308 | infrastructure is designed around the concept of long-running worker threads | |
309 | responsible for specific jobs within each daemon. This is not to say that | |
310 | divide and conquer, thread pooling, etc. could not be implemented in the | |
311 | future. However, designs in this direction must be very careful to take into | |
312 | account the existing codebase. Introducing kernel threads into programs that | |
313 | have been written under the assumption of a single thread of execution must be | |
314 | done very carefully to avoid insidious errors and to ensure the program remains | |
315 | understandable and maintainable. | |
316 | ||
317 | In keeping with these goals, future work on kernel threading should be | |
318 | extensively documented here and FRR developers should be very careful with | |
319 | their design choices, as poor choices tightly integrated can prove to be | |
320 | catastrophic for development efforts in the future. |