]> git.proxmox.com Git - mirror_frr.git/blob - doc/developer/process-architecture.rst
Merge pull request #13261 from LabNConsulting/ziemba/rfapi-memleak-cleanup-12478-2
[mirror_frr.git] / doc / developer / process-architecture.rst
1 .. _process-architecture:
2
3 Process Architecture
4 ====================
5
6 FRR is a suite of daemons that serve different functions. This document
7 describes internal architecture of daemons, focusing their general design
8 patterns, and especially how threads are used in the daemons that use them.
9
10 Overview
11 --------
12 The fundamental pattern used in FRR daemons is an `event loop
13 <https://en.wikipedia.org/wiki/Event_loop>`_. Some daemons use `kernel threads
14 <https://en.wikipedia.org/wiki/Thread_(computing)#Kernel_threads>`_. In these
15 daemons, each kernel thread runs its own event loop. The event loop
16 implementation is constructed to be thread safe and to allow threads other than
17 its owning thread to schedule events on it. The rest of this document describes
18 these two designs in detail.
19
20 Terminology
21 -----------
22 Because this document describes the architecture for kernel threads as well as
23 the event system, a digression on terminology is in order here.
24
25 Historically Quagga's loop system was viewed as an implementation of userspace
26 threading. Because of this design choice, the names for various datastructures
27 within the event system are variations on the term "thread". The primary
28 datastructure that holds the state of an event loop in this system is called a
29 "threadmaster". Events scheduled on the event loop - what would today be called
30 an 'event' or 'task' in systems such as libevent - are called "threads" and the
31 datastructure for them is ``struct event``. To add to the confusion, these
32 "threads" have various types, one of which is "event". To hopefully avoid some
33 of this confusion, this document refers to these "threads" as a 'task' except
34 where the datastructures are explicitly named. When they are explicitly named,
35 they will be formatted ``like this`` to differentiate from the conceptual
36 names. When speaking of kernel threads, the term used will be "pthread" since
37 FRR's kernel threading implementation uses the POSIX threads API.
38
39 .. This should be broken into its document under :ref:`libfrr`
40 .. _event-architecture:
41
42 Event Architecture
43 ------------------
44 This section presents a brief overview of the event model as currently
45 implemented in FRR. This doc should be expanded and broken off into its own
46 section. For now it provides basic information necessary to understand the
47 interplay between the event system and kernel threads.
48
49 The core event system is implemented in :file:`lib/thread.[ch]`. The primary
50 structure is ``struct event_loop``, hereafter referred to as a
51 ``threadmaster``. A ``threadmaster`` is a global state object, or context, that
52 holds all the tasks currently pending execution as well as statistics on tasks
53 that have already executed. The event system is driven by adding tasks to this
54 data structure and then calling a function to retrieve the next task to
55 execute. At initialization, a daemon will typically create one
56 ``threadmaster``, add a small set of initial tasks, and then run a loop to
57 fetch each task and execute it.
58
59 These tasks have various types corresponding to their general action. The types
60 are given by integer macros in :file:`event.h` and are:
61
62 ``THREAD_READ``
63 Task which waits for a file descriptor to become ready for reading and then
64 executes.
65
66 ``THREAD_WRITE``
67 Task which waits for a file descriptor to become ready for writing and then
68 executes.
69
70 ``THREAD_TIMER``
71 Task which executes after a certain amount of time has passed since it was
72 scheduled.
73
74 ``THREAD_EVENT``
75 Generic task that executes with high priority and carries an arbitrary
76 integer indicating the event type to its handler. These are commonly used to
77 implement the finite state machines typically found in routing protocols.
78
79 ``THREAD_READY``
80 Type used internally for tasks on the ready queue.
81
82 ``THREAD_UNUSED``
83 Type used internally for ``struct event`` objects that aren't being used.
84 The event system pools ``struct event`` to avoid heap allocations; this is
85 the type they have when they're in the pool.
86
87 ``THREAD_EXECUTE``
88 Just before a task is run its type is changed to this. This is used to show
89 ``X`` as the type in the output of :clicmd:`show thread cpu`.
90
91 The programmer never has to work with these types explicitly. Each type of task
92 is created and queued via special-purpose functions (actually macros, but
93 irrelevant for the time being) for the specific type. For example, to add a
94 ``THREAD_READ`` task, you would call
95
96 ::
97
98 event_add_read(struct event_loop *master, int (*handler)(struct event *), void *arg, int fd, struct event **ref);
99
100 The ``struct event`` is then created and added to the appropriate internal
101 datastructure within the ``threadmaster``. Note that the ``READ`` and
102 ``WRITE`` tasks are independent - a ``READ`` task only tests for
103 readability, for example.
104
105 The Event Loop
106 ^^^^^^^^^^^^^^
107 To use the event system, after creating a ``threadmaster`` the program adds an
108 initial set of tasks. As these tasks execute, they add more tasks that execute
109 at some point in the future. This sequence of tasks drives the lifecycle of the
110 program. When no more tasks are available, the program dies. Typically at
111 startup the first task added is an I/O task for VTYSH as well as any network
112 sockets needed for peerings or IPC.
113
114 To retrieve the next task to run the program calls ``event_fetch()``.
115 ``event_fetch()`` internally computes which task to execute next based on
116 rudimentary priority logic. Events (type ``THREAD_EVENT``) execute with the
117 highest priority, followed by expired timers and finally I/O tasks (type
118 ``THREAD_READ`` and ``THREAD_WRITE``). When scheduling a task a function and an
119 arbitrary argument are provided. The task returned from ``event_fetch()`` is
120 then executed with ``event_call()``.
121
122 The following diagram illustrates a simplified version of this infrastructure.
123
124 .. todo: replace these with SVG
125 .. figure:: ../figures/threadmaster-single.png
126 :align: center
127
128 Lifecycle of a program using a single threadmaster.
129
130 The series of "task" boxes represents the current ready task queue. The various
131 other queues for other types are not shown. The fetch-execute loop is
132 illustrated at the bottom.
133
134 Mapping the general names used in the figure to specific FRR functions:
135
136 - ``task`` is ``struct event *``
137 - ``fetch`` is ``event_fetch()``
138 - ``exec()`` is ``event_call``
139 - ``cancel()`` is ``event_cancel()``
140 - ``schedule()`` is any of the various task-specific ``event_add_*`` functions
141
142 Adding tasks is done with various task-specific function-like macros. These
143 macros wrap underlying functions in :file:`thread.c` to provide additional
144 information added at compile time, such as the line number the task was
145 scheduled from, that can be accessed at runtime for debugging, logging and
146 informational purposes. Each task type has its own specific scheduling function
147 that follow the naming convention ``event_add_<type>``; see :file:`event.h`
148 for details.
149
150 There are some gotchas to keep in mind:
151
152 - I/O tasks are keyed off the file descriptor associated with the I/O
153 operation. This means that for any given file descriptor, only one of each
154 type of I/O task (``THREAD_READ`` and ``THREAD_WRITE``) can be scheduled. For
155 example, scheduling two write tasks one after the other will overwrite the
156 first task with the second, resulting in total loss of the first task and
157 difficult bugs.
158
159 - Timer tasks are only as accurate as the monotonic clock provided by the
160 underlying operating system.
161
162 - Memory management of the arbitrary handler argument passed in the schedule
163 call is the responsibility of the caller.
164
165
166 Kernel Thread Architecture
167 --------------------------
168 Efforts have begun to introduce kernel threads into FRR to improve performance
169 and stability. Naturally a kernel thread architecture has long been seen as
170 orthogonal to an event-driven architecture, and the two do have significant
171 overlap in terms of design choices. Since the event model is tightly integrated
172 into FRR, careful thought has been put into how pthreads are introduced, what
173 role they fill, and how they will interoperate with the event model.
174
175 Design Overview
176 ^^^^^^^^^^^^^^^
177 Each kernel thread behaves as a lightweight process within FRR, sharing the
178 same process memory space. On the other hand, the event system is designed to
179 run in a single process and drive serial execution of a set of tasks. With this
180 consideration, a natural choice is to implement the event system within each
181 kernel thread. This allows us to leverage the event-driven execution model with
182 the currently existing task and context primitives. In this way the familiar
183 execution model of FRR gains the ability to execute tasks simultaneously while
184 preserving the existing model for concurrency.
185
186 The following figure illustrates the architecture with multiple pthreads, each
187 running their own ``threadmaster``-based event loop.
188
189 .. todo: replace these with SVG
190 .. figure:: ../figures/threadmaster-multiple.png
191 :align: center
192
193 Lifecycle of a program using multiple pthreads, each running their own
194 ``threadmaster``
195
196 Each roundrect represents a single pthread running the same event loop
197 described under :ref:`event-architecture`. Note the arrow from the ``exec()``
198 box on the right to the ``schedule()`` box in the middle pthread. This
199 illustrates code running in one pthread scheduling a task onto another
200 pthread's threadmaster. A global lock for each ``threadmaster`` is used to
201 synchronize these operations. The pthread names are examples.
202
203
204 .. This should be broken into its document under :ref:`libfrr`
205 .. _kernel-thread-wrapper:
206
207 Kernel Thread Wrapper
208 ^^^^^^^^^^^^^^^^^^^^^
209 The basis for the integration of pthreads and the event system is a lightweight
210 wrapper for both systems implemented in :file:`lib/frr_pthread.[ch]`. The
211 header provides a core datastructure, ``struct frr_pthread``, that encapsulates
212 structures from both POSIX threads and :file:`thread.[ch]`. In particular, this
213 datastructure has a pointer to a ``threadmaster`` that runs within the pthread.
214 It also has fields for a name as well as start and stop functions that have
215 signatures similar to the POSIX arguments for ``pthread_create()``.
216
217 Calling ``frr_pthread_new()`` creates and registers a new ``frr_pthread``. The
218 returned structure has a pre-initialized ``threadmaster``, and its ``start``
219 and ``stop`` functions are initialized to defaults that will run a basic event
220 loop with the given threadmaster. Calling ``frr_pthread_run`` starts the thread
221 with the ``start`` function. From there, the model is the same as the regular
222 event model. To schedule tasks on a particular pthread, simply use the regular
223 :file:`thread.c` functions as usual and provide the ``threadmaster`` pointed to
224 from the ``frr_pthread``. As part of implementing the wrapper, the
225 :file:`thread.c` functions were made thread-safe. Consequently, it is safe to
226 schedule events on a ``threadmaster`` belonging both to the calling thread as
227 well as *any other pthread*. This serves as the basis for inter-thread
228 communication and boils down to a slightly more complicated method of message
229 passing, where the messages are the regular task events as used in the
230 event-driven model. The only difference is thread cancellation, which requires
231 calling ``event_cancel_async()`` instead of ``event_cancel`` to cancel a task
232 currently scheduled on a ``threadmaster`` belonging to a different pthread.
233 This is necessary to avoid race conditions in the specific case where one
234 pthread wants to guarantee that a task on another pthread is cancelled before
235 proceeding.
236
237 In addition, the existing commands to show statistics and other information for
238 tasks within the event driven model have been expanded to handle multiple
239 pthreads; running :clicmd:`show thread cpu` will display the usual event
240 breakdown, but it will do so for each pthread running in the program. For
241 example, :ref:`bgpd` runs a dedicated I/O pthread and shows the following
242 output for :clicmd:`show thread cpu`:
243
244 ::
245
246 frr# show thread cpu
247
248 Thread statistics for bgpd:
249
250 Showing statistics for pthread main
251 ------------------------------------
252 CPU (user+system): Real (wall-clock):
253 Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
254 0 1389.000 10 138900 248000 135549 255349 T subgroup_coalesce_timer
255 0 0.000 1 0 0 18 18 T bgp_startup_timer_expire
256 0 850.000 18 47222 222000 47795 233814 T work_queue_run
257 0 0.000 10 0 0 6 14 T update_subgroup_merge_check_thread_cb
258 0 0.000 8 0 0 117 160 W zclient_flush_data
259 2 2.000 1 2000 2000 831 831 R bgp_accept
260 0 1.000 1 1000 1000 2832 2832 E zclient_connect
261 1 42082.000 240574 174 37000 178 72810 R vtysh_read
262 1 152.000 1885 80 2000 96 6292 R zclient_read
263 0 549346.000 2997298 183 7000 153 20242 E bgp_event
264 0 2120.000 300 7066 14000 6813 22046 T (bgp_holdtime_timer)
265 0 0.000 2 0 0 57 59 T update_group_refresh_default_originate_route_map
266 0 90.000 1 90000 90000 73729 73729 T bgp_route_map_update_timer
267 0 1417.000 9147 154 48000 132 61998 T bgp_process_packet
268 300 71807.000 2995200 23 3000 24 11066 T (bgp_connect_timer)
269 0 1894.000 12713 148 45000 112 33606 T (bgp_generate_updgrp_packets)
270 0 0.000 1 0 0 105 105 W vtysh_write
271 0 52.000 599 86 2000 138 6992 T (bgp_start_timer)
272 1 1.000 8 125 1000 164 593 R vtysh_accept
273 0 15.000 600 25 2000 15 153 T (bgp_routeadv_timer)
274 0 11.000 299 36 3000 53 3128 RW bgp_connect_check
275
276
277 Showing statistics for pthread BGP I/O thread
278 ----------------------------------------------
279 CPU (user+system): Real (wall-clock):
280 Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
281 0 1611.000 9296 173 13000 188 13685 R bgp_process_reads
282 0 2995.000 11753 254 26000 182 29355 W bgp_process_writes
283
284
285 Showing statistics for pthread BGP Keepalives thread
286 -----------------------------------------------------
287 CPU (user+system): Real (wall-clock):
288 Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
289 No data to display yet.
290
291 Attentive readers will notice that there is a third thread, the Keepalives
292 thread. This thread is responsible for -- surprise -- generating keepalives for
293 peers. However, there are no statistics showing for that thread. Although the
294 pthread uses the ``frr_pthread`` wrapper, it opts not to use the embedded
295 ``threadmaster`` facilities. Instead it replaces the ``start`` and ``stop``
296 functions with custom functions. This was done because the ``threadmaster``
297 facilities introduce a small but significant amount of overhead relative to the
298 pthread's task. In this case since the pthread does not need the event-driven
299 model and does not need to receive tasks from other pthreads, it is simpler and
300 more efficient to implement it outside of the provided event facilities. The
301 point to take away from this example is that while the facilities to make using
302 pthreads within FRR easy are already implemented, the wrapper is flexible and
303 allows usage of other models while still integrating with the rest of the FRR
304 core infrastructure. Starting and stopping this pthread works the same as it
305 does for any other ``frr_pthread``; the only difference is that event
306 statistics are not collected for it, because there are no events.
307
308 Notes on Design and Documentation
309 ---------------------------------
310 Because of the choice to embed the existing event system into each pthread
311 within FRR, at this time there is not integrated support for other models of
312 pthread use such as divide and conquer. Similarly, there is no explicit support
313 for thread pooling or similar higher level constructs. The currently existing
314 infrastructure is designed around the concept of long-running worker threads
315 responsible for specific jobs within each daemon. This is not to say that
316 divide and conquer, thread pooling, etc. could not be implemented in the
317 future. However, designs in this direction must be very careful to take into
318 account the existing codebase. Introducing kernel threads into programs that
319 have been written under the assumption of a single thread of execution must be
320 done very carefully to avoid insidious errors and to ensure the program remains
321 understandable and maintainable.
322
323 In keeping with these goals, future work on kernel threading should be
324 extensively documented here and FRR developers should be very careful with
325 their design choices, as poor choices tightly integrated can prove to be
326 catastrophic for development efforts in the future.