]> git.proxmox.com Git - mirror_frr.git/blob - doc/developer/process-architecture.rst
doc: add a community rule to permit usage of GPLv3
[mirror_frr.git] / doc / developer / process-architecture.rst
1 .. _process-architecture:
2
3 Process Architecture
4 ====================
5
6 FRR inherited its overall design architecture from Quagga. The chosen model for
7 Quagga is that of a suite of independent daemons that do IPC via Unix domain
8 sockets. Within each daemon, the architecture follows the event-driven model.
9 FRR has inherited this model as well. As FRR is deployed at larger scales and
10 gains ever more features, each adding to the overall processing workload, we
11 are approaching the saturation point for a single thread per daemon. In light
12 of this, there are ongoing efforts to introduce multithreading to various
13 components of FRR. This document aims to describe the current design choices
14 and overall model for integrating the event-driven and multithreaded
15 architectures into a cohesive whole.
16
17 Terminology
18 -----------
19 Because this document describes the architecture for true kernel threads as
20 well as the event system, a digression on terminology is in order here.
21
22 Historically Quagga's event system was viewed as an implementation of userspace
23 threading. Because of this design choice, the names for various datastructures
24 within the event system are variations on the term "thread". The primary
25 context datastructure in this system is called a "threadmaster". What would
26 today be called an 'event' or 'task' in systems such as libevent are called
27 "threads" and the datastructure for them is ``struct thread``. To add to the
28 confusion, these "threads" have various types, one of which is "event". To
29 hopefully avoid some of this confusion, this document refers to these "threads"
30 as a 'task' except where the datastructures are explicitly named. When they are
31 explicitly named, they will be formatted ``like this`` to differentiate from
32 the conceptual names. When speaking of kernel threads, the term used will be
33 "pthread" since FRR's kernel threading implementation is POSIX threads.
34
35 .. This should be broken into its document under :ref:`libfrr`
36 .. _event-architecture:
37
38 Event Architecture
39 ------------------
40 This section presents a brief overview of the event model as currently
41 implemented in FRR. This doc should be expanded and broken off into its own
42 section. For now it provides basic information necessary to understand the
43 interplay between the event system and kernel threads.
44
45 The core event system is implemented in :file:`lib/thread.[ch]`. The primary
46 structure is ``struct thread_master``, hereafter referred to as a
47 ``threadmaster``. A ``threadmaster`` is a global state object, or context, that
48 holds all the tasks currently pending execution as well as statistics on tasks
49 that have already executed. The event system is driven by adding tasks to this
50 data structure and then calling a function to retrieve the next task to
51 execute. At initialization, a daemon will typically create one
52 ``threadmaster``, add a small set of initial tasks, and then run a loop to
53 fetch each task and execute it.
54
55 These tasks have various types corresponding to their general action. The types
56 are given by integer macros in :file:`thread.h` and are:
57
58 ``THREAD_READ``
59 Task which waits for a file descriptor to become ready for reading and then
60 executes.
61
62 ``THREAD_WRITE``
63 Task which waits for a file descriptor to become ready for writing and then
64 executes.
65
66 ``THREAD_TIMER``
67 Task which executes after a certain amount of time has passed since it was
68 scheduled.
69
70 ``THREAD_EVENT``
71 Generic task that executes with high priority and carries an arbitrary
72 integer indicating the event type to its handler. These are commonly used to
73 implement the finite state machines typically found in routing protocols.
74
75 ``THREAD_READY``
76 Type used internally for tasks on the ready queue.
77
78 ``THREAD_UNUSED``
79 Type used internally for ``struct thread`` objects that aren't being used.
80 The event system pools ``struct thread`` to avoid heap allocations; this is
81 the type they have when they're in the pool.
82
83 ``THREAD_EXECUTE``
84 Just before a task is run its type is changed to this. This is used to show
85 ``X`` as the type in the output of :clicmd:`show thread cpu`.
86
87 The programmer never has to work with these types explicitly. Each type of task
88 is created and queued via special-purpose functions (actually macros, but
89 irrelevant for the time being) for the specific type. For example, to add a
90 ``THREAD_READ`` task, you would call
91
92 ::
93
94 thread_add_read(struct thread_master *master, int (*handler)(struct thread *), void *arg, int fd, struct thread **ref);
95
96 The ``struct thread`` is then created and added to the appropriate internal
97 datastructure within the ``threadmaster``.
98
99 The Event Loop
100 ^^^^^^^^^^^^^^
101 To use the event system, after creating a ``threadmaster`` the program adds an
102 initial set of tasks. As these tasks execute, they add more tasks that execute
103 at some point in the future. This sequence of tasks drives the lifecycle of the
104 program. When no more tasks are available, the program dies. Typically at
105 startup the first task added is an I/O task for VTYSH as well as any network
106 sockets needed for peerings or IPC.
107
108 To retrieve the next task to run the program calls ``thread_fetch()``.
109 ``thread_fetch()`` internally computes which task to execute next based on
110 rudimentary priority logic. Events (type ``THREAD_EVENT``) execute with the
111 highest priority, followed by expired timers and finally I/O tasks (type
112 ``THREAD_READ`` and ``THREAD_WRITE``). When scheduling a task a function and an
113 arbitrary argument are provided. The task returned from ``thread_fetch()`` is
114 then executed with ``thread_call()``.
115
116 The following diagram illustrates a simplified version of this infrastructure.
117
118 .. todo: replace these with SVG
119 .. figure:: ../figures/threadmaster-single.png
120 :align: center
121
122 Lifecycle of a program using a single threadmaster.
123
124 The series of "task" boxes represents the current ready task queue. The various
125 other queues for other types are not shown. The fetch-execute loop is
126 illustrated at the bottom.
127
128 Mapping the general names used in the figure to specific FRR functions:
129
130 - ``task`` is ``struct thread *``
131 - ``fetch`` is ``thread_fetch()``
132 - ``exec()`` is ``thread_call``
133 - ``cancel()`` is ``thread_cancel()``
134 - ``schedule()`` is any of the various task-specific ``thread_add_*`` functions
135
136 Adding tasks is done with various task-specific function-like macros. These
137 macros wrap underlying functions in :file:`thread.c` to provide additional
138 information added at compile time, such as the line number the task was
139 scheduled from, that can be accessed at runtime for debugging, logging and
140 informational purposes. Each task type has its own specific scheduling function
141 that follow the naming convention ``thread_add_<type>``; see :file:`thread.h`
142 for details.
143
144 There are some gotchas to keep in mind:
145
146 - I/O tasks are keyed off the file descriptor associated with the I/O
147 operation. This means that for any given file descriptor, only one of each
148 type of I/O task (``THREAD_READ`` and ``THREAD_WRITE``) can be scheduled. For
149 example, scheduling two write tasks one after the other will overwrite the
150 first task with the second, resulting in total loss of the first task and
151 difficult bugs.
152
153 - Timer tasks are only as accurate as the monotonic clock provided by the
154 underlying operating system.
155
156 - Memory management of the arbitrary handler argument passed in the schedule
157 call is the responsibility of the caller.
158
159
160 Kernel Thread Architecture
161 --------------------------
162 Efforts have begun to introduce kernel threads into FRR to improve performance
163 and stability. Naturally a kernel thread architecture has long been seen as
164 orthogonal to an event-driven architecture, and the two do have significant
165 overlap in terms of design choices. Since the event model is tightly integrated
166 into FRR, careful thought has been put into how pthreads are introduced, what
167 role they fill, and how they will interoperate with the event model.
168
169 Design Overview
170 ^^^^^^^^^^^^^^^
171 Each kernel thread behaves as a lightweight process within FRR, sharing the
172 same process memory space. On the other hand, the event system is designed to
173 run in a single process and drive serial execution of a set of tasks. With this
174 consideration, a natural choice is to implement the event system within each
175 kernel thread. This allows us to leverage the event-driven execution model with
176 the currently existing task and context primitives. In this way the familiar
177 execution model of FRR gains the ability to execute tasks simultaneously while
178 preserving the existing model for concurrency.
179
180 The following figure illustrates the architecture with multiple pthreads, each
181 running their own ``threadmaster``-based event loop.
182
183 .. todo: replace these with SVG
184 .. figure:: ../figures/threadmaster-multiple.png
185 :align: center
186
187 Lifecycle of a program using multiple pthreads, each running their own
188 ``threadmaster``
189
190 Each roundrect represents a single pthread running the same event loop
191 described under :ref:`event-architecture`. Note the arrow from the ``exec()``
192 box on the right to the ``schedule()`` box in the middle pthread. This
193 illustrates code running in one pthread scheduling a task onto another
194 pthread's threadmaster. A global lock for each ``threadmaster`` is used to
195 synchronize these operations. The pthread names are examples.
196
197
198 .. This should be broken into its document under :ref:`libfrr`
199 .. _kernel-thread-wrapper:
200
201 Kernel Thread Wrapper
202 ^^^^^^^^^^^^^^^^^^^^^
203 The basis for the integration of pthreads and the event system is a lightweight
204 wrapper for both systems implemented in :file:`lib/frr_pthread.[ch]`. The
205 header provides a core datastructure, ``struct frr_pthread``, that encapsulates
206 structures from both POSIX threads and :file:`thread.[ch]`. In particular, this
207 datastructure has a pointer to a ``threadmaster`` that runs within the pthread.
208 It also has fields for a name as well as start and stop functions that have
209 signatures similar to the POSIX arguments for ``pthread_create()``.
210
211 Calling ``frr_pthread_new()`` creates and registers a new ``frr_pthread``. The
212 returned structure has a pre-initialized ``threadmaster``, and its ``start``
213 and ``stop`` functions are initialized to defaults that will run a basic event
214 loop with the given threadmaster. Calling ``frr_pthread_run`` starts the thread
215 with the ``start`` function. From there, the model is the same as the regular
216 event model. To schedule tasks on a particular pthread, simply use the regular
217 :file:`thread.c` functions as usual and provide the ``threadmaster`` pointed to
218 from the ``frr_pthread``. As part of implementing the wrapper, the
219 :file:`thread.c` functions were made thread-safe. Consequently, it is safe to
220 schedule events on a ``threadmaster`` belonging both to the calling thread as
221 well as *any other pthread*. This serves as the basis for inter-thread
222 communication and boils down to a slightly more complicated method of message
223 passing, where the messages are the regular task events as used in the
224 event-driven model. The only difference is thread cancellation, which requires
225 calling ``thread_cancel_async()`` instead of ``thread_cancel`` to cancel a task
226 currently scheduled on a ``threadmaster`` belonging to a different pthread.
227 This is necessary to avoid race conditions in the specific case where one
228 pthread wants to guarantee that a task on another pthread is cancelled before
229 proceeding.
230
231 In addition, the existing commands to show statistics and other information for
232 tasks within the event driven model have been expanded to handle multiple
233 pthreads; running :clicmd:`show thread cpu` will display the usual event
234 breakdown, but it will do so for each pthread running in the program. For
235 example, :ref:`bgpd` runs a dedicated I/O pthread and shows the following
236 output for :clicmd:`show thread cpu`:
237
238 ::
239
240 frr# show thread cpu
241
242 Thread statistics for bgpd:
243
244 Showing statistics for pthread main
245 ------------------------------------
246 CPU (user+system): Real (wall-clock):
247 Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
248 0 1389.000 10 138900 248000 135549 255349 T subgroup_coalesce_timer
249 0 0.000 1 0 0 18 18 T bgp_startup_timer_expire
250 0 850.000 18 47222 222000 47795 233814 T work_queue_run
251 0 0.000 10 0 0 6 14 T update_subgroup_merge_check_thread_cb
252 0 0.000 8 0 0 117 160 W zclient_flush_data
253 2 2.000 1 2000 2000 831 831 R bgp_accept
254 0 1.000 1 1000 1000 2832 2832 E zclient_connect
255 1 42082.000 240574 174 37000 178 72810 R vtysh_read
256 1 152.000 1885 80 2000 96 6292 R zclient_read
257 0 549346.000 2997298 183 7000 153 20242 E bgp_event
258 0 2120.000 300 7066 14000 6813 22046 T (bgp_holdtime_timer)
259 0 0.000 2 0 0 57 59 T update_group_refresh_default_originate_route_map
260 0 90.000 1 90000 90000 73729 73729 T bgp_route_map_update_timer
261 0 1417.000 9147 154 48000 132 61998 T bgp_process_packet
262 300 71807.000 2995200 23 3000 24 11066 T (bgp_connect_timer)
263 0 1894.000 12713 148 45000 112 33606 T (bgp_generate_updgrp_packets)
264 0 0.000 1 0 0 105 105 W vtysh_write
265 0 52.000 599 86 2000 138 6992 T (bgp_start_timer)
266 1 1.000 8 125 1000 164 593 R vtysh_accept
267 0 15.000 600 25 2000 15 153 T (bgp_routeadv_timer)
268 0 11.000 299 36 3000 53 3128 RW bgp_connect_check
269
270
271 Showing statistics for pthread BGP I/O thread
272 ----------------------------------------------
273 CPU (user+system): Real (wall-clock):
274 Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
275 0 1611.000 9296 173 13000 188 13685 R bgp_process_reads
276 0 2995.000 11753 254 26000 182 29355 W bgp_process_writes
277
278
279 Showing statistics for pthread BGP Keepalives thread
280 -----------------------------------------------------
281 CPU (user+system): Real (wall-clock):
282 Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
283 No data to display yet.
284
285 Attentive readers will notice that there is a third thread, the Keepalives
286 thread. This thread is responsible for -- surprise -- generating keepalives for
287 peers. However, there are no statistics showing for that thread. Although the
288 pthread uses the ``frr_pthread`` wrapper, it opts not to use the embedded
289 ``threadmaster`` facilities. Instead it replaces the ``start`` and ``stop``
290 functions with custom functions. This was done because the ``threadmaster``
291 facilities introduce a small but significant amount of overhead relative to the
292 pthread's task. In this case since the pthread does not need the event-driven
293 model and does not need to receive tasks from other pthreads, it is simpler and
294 more efficient to implement it outside of the provided event facilities. The
295 point to take away from this example is that while the facilities to make using
296 pthreads within FRR easy are already implemented, the wrapper is flexible and
297 allows usage of other models while still integrating with the rest of the FRR
298 core infrastructure. Starting and stopping this pthread works the same as it
299 does for any other ``frr_pthread``; the only difference is that event
300 statistics are not collected for it, because there are no events.
301
302 Notes on Design and Documentation
303 ---------------------------------
304 Because of the choice to embed the existing event system into each pthread
305 within FRR, at this time there is not integrated support for other models of
306 pthread use such as divide and conquer. Similarly, there is no explicit support
307 for thread pooling or similar higher level constructs. The currently existing
308 infrastructure is designed around the concept of long-running worker threads
309 responsible for specific jobs within each daemon. This is not to say that
310 divide and conquer, thread pooling, etc. could not be implemented in the
311 future. However, designs in this direction must be very careful to take into
312 account the existing codebase. Introducing kernel threads into programs that
313 have been written under the assumption of a single thread of execution must be
314 done very carefully to avoid insidious errors and to ensure the program remains
315 understandable and maintainable.
316
317 In keeping with these goals, future work on kernel threading should be
318 extensively documented here and FRR developers should be very careful with
319 their design choices, as poor choices tightly integrated can prove to be
320 catastrophic for development efforts in the future.