doc/developer/process-architecture.rst

   1 .. _process-architecture:
   2
   3 Process Architecture
   4 ====================
   5
   6 FRR inherited its overall design architecture from Quagga. The chosen model for
   7 Quagga is that of a suite of independent daemons that do IPC via Unix domain
   8 sockets. Within each daemon, the architecture follows the event-driven model.
   9 FRR has inherited this model as well. As FRR is deployed at larger scales and
  10 gains ever more features, each adding to the overall processing workload, we
  11 are approaching the saturation point for a single thread per daemon. In light
  12 of this, there are ongoing efforts to introduce multithreading to various
  13 components of FRR. This document aims to describe the current design choices
  14 and overall model for integrating the event-driven and multithreaded
  15 architectures into a cohesive whole.
  16
  17 Terminology
  18 -----------
  19 Because this document describes the architecture for true kernel threads as
  20 well as the event system, a digression on terminology is in order here.
  21
  22 Historically Quagga's event system was viewed as an implementation of userspace
  23 threading. Because of this design choice, the names for various datastructures
  24 within the event system are variations on the term "thread". The primary
  25 context datastructure in this system is called a "threadmaster". What would
  26 today be called an 'event' or 'task' in systems such as libevent are called
  27 "threads" and the datastructure for them is ``struct thread``. To add to the
  28 confusion, these "threads" have various types, one of which is "event". To
  29 hopefully avoid some of this confusion, this document refers to these "threads"
  30 as a 'task' except where the datastructures are explicitly named. When they are
  31 explicitly named, they will be formatted ``like this`` to differentiate from
  32 the conceptual names. When speaking of kernel threads, the term used will be
  33 "pthread" since FRR's kernel threading implementation is POSIX threads.
  34
  35 .. This should be broken into its document under :ref:`libfrr`
  36 .. _event-architecture:
  37
  38 Event Architecture
  39 ------------------
  40 This section presents a brief overview of the event model as currently
  41 implemented in FRR. This doc should be expanded and broken off into its own
  42 section. For now it provides basic information necessary to understand the
  43 interplay between the event system and kernel threads.
  44
  45 The core event system is implemented in :file:`lib/thread.[ch]`. The primary
  46 structure is ``struct thread_master``, hereafter referred to as a
  47 ``threadmaster``. A ``threadmaster`` is a global state object, or context, that
  48 holds all the tasks currently pending execution as well as statistics on tasks
  49 that have already executed. The event system is driven by adding tasks to this
  50 data structure and then calling a function to retrieve the next task to
  51 execute. At initialization, a daemon will typically create one
  52 ``threadmaster``, add a small set of initial tasks, and then run a loop to
  53 fetch each task and execute it.
  54
  55 These tasks have various types corresponding to their general action. The types
  56 are given by integer macros in :file:`thread.h` and are:
  57
  58 ``THREAD_READ``
  59    Task which waits for a file descriptor to become ready for reading and then
  60    executes.
  61
  62 ``THREAD_WRITE``
  63    Task which waits for a file descriptor to become ready for writing and then
  64    executes.
  65
  66 ``THREAD_TIMER``
  67    Task which executes after a certain amount of time has passed since it was
  68    scheduled.
  69
  70 ``THREAD_EVENT``
  71    Generic task that executes with high priority and carries an arbitrary
  72    integer indicating the event type to its handler. These are commonly used to
  73    implement the finite state machines typically found in routing protocols.
  74
  75 ``THREAD_READY``
  76    Type used internally for tasks on the ready queue.
  77
  78 ``THREAD_UNUSED``
  79    Type used internally for ``struct thread`` objects that aren't being used.
  80    The event system pools ``struct thread`` to avoid heap allocations; this is
  81    the type they have when they're in the pool.
  82
  83 ``THREAD_EXECUTE``
  84    Just before a task is run its type is changed to this. This is used to show
  85    ``X`` as the type in the output of :clicmd:`show thread cpu`.
  86
  87 The programmer never has to work with these types explicitly. Each type of task
  88 is created and queued via special-purpose functions (actually macros, but
  89 irrelevant for the time being) for the specific type. For example, to add a
  90 ``THREAD_READ`` task, you would call
  91
  92 ::
  93
  94    thread_add_read(struct thread_master *master, int (*handler)(struct thread *), void *arg, int fd, struct thread **ref);
  95
  96 The ``struct thread`` is then created and added to the appropriate internal
  97 datastructure within the ``threadmaster``.
  98
  99 The Event Loop
 100 ^^^^^^^^^^^^^^
 101 To use the event system, after creating a ``threadmaster`` the program adds an
 102 initial set of tasks. As these tasks execute, they add more tasks that execute
 103 at some point in the future. This sequence of tasks drives the lifecycle of the
 104 program. When no more tasks are available, the program dies. Typically at
 105 startup the first task added is an I/O task for VTYSH as well as any network
 106 sockets needed for peerings or IPC.
 107
 108 To retrieve the next task to run the program calls ``thread_fetch()``.
 109 ``thread_fetch()`` internally computes which task to execute next based on
 110 rudimentary priority logic. Events (type ``THREAD_EVENT``) execute with the
 111 highest priority, followed by expired timers and finally I/O tasks (type
 112 ``THREAD_READ`` and ``THREAD_WRITE``). When scheduling a task a function and an
 113 arbitrary argument are provided. The task returned from ``thread_fetch()`` is
 114 then executed with ``thread_call()``.
 115
 116 The following diagram illustrates a simplified version of this infrastructure.
 117
 118 .. todo: replace these with SVG
 119 .. figure:: ../figures/threadmaster-single.png
 120    :align: center
 121
 122    Lifecycle of a program using a single threadmaster.
 123
 124 The series of "task" boxes represents the current ready task queue. The various
 125 other queues for other types are not shown. The fetch-execute loop is
 126 illustrated at the bottom.
 127
 128 Mapping the general names used in the figure to specific FRR functions:
 129
 130 - ``task`` is ``struct thread *``
 131 - ``fetch`` is ``thread_fetch()``
 132 - ``exec()`` is ``thread_call``
 133 - ``cancel()`` is ``thread_cancel()``
 134 - ``schedule()`` is any of the various task-specific ``thread_add_*`` functions
 135
 136 Adding tasks is done with various task-specific function-like macros. These
 137 macros wrap underlying functions in :file:`thread.c` to provide additional
 138 information added at compile time, such as the line number the task was
 139 scheduled from, that can be accessed at runtime for debugging, logging and
 140 informational purposes. Each task type has its own specific scheduling function
 141 that follow the naming convention ``thread_add_<type>``; see :file:`thread.h`
 142 for details.
 143
 144 There are some gotchas to keep in mind:
 145
 146 - I/O tasks are keyed off the file descriptor associated with the I/O
 147   operation. This means that for any given file descriptor, only one of each
 148   type of I/O task (``THREAD_READ`` and ``THREAD_WRITE``) can be scheduled. For
 149   example, scheduling two write tasks one after the other will overwrite the
 150   first task with the second, resulting in total loss of the first task and
 151   difficult bugs.
 152
 153 - Timer tasks are only as accurate as the monotonic clock provided by the
 154   underlying operating system.
 155
 156 - Memory management of the arbitrary handler argument passed in the schedule
 157   call is the responsibility of the caller.
 158
 159
 160 Kernel Thread Architecture
 161 --------------------------
 162 Efforts have begun to introduce kernel threads into FRR to improve performance
 163 and stability. Naturally a kernel thread architecture has long been seen as
 164 orthogonal to an event-driven architecture, and the two do have significant
 165 overlap in terms of design choices. Since the event model is tightly integrated
 166 into FRR, careful thought has been put into how pthreads are introduced, what
 167 role they fill, and how they will interoperate with the event model.
 168
 169 Design Overview
 170 ^^^^^^^^^^^^^^^
 171 Each kernel thread behaves as a lightweight process within FRR, sharing the
 172 same process memory space. On the other hand, the event system is designed to
 173 run in a single process and drive serial execution of a set of tasks. With this
 174 consideration, a natural choice is to implement the event system within each
 175 kernel thread. This allows us to leverage the event-driven execution model with
 176 the currently existing task and context primitives. In this way the familiar
 177 execution model of FRR gains the ability to execute tasks simultaneously while
 178 preserving the existing model for concurrency.
 179
 180 The following figure illustrates the architecture with multiple pthreads, each
 181 running their own ``threadmaster``-based event loop.
 182
 183 .. todo: replace these with SVG
 184 .. figure:: ../figures/threadmaster-multiple.png
 185    :align: center
 186
 187    Lifecycle of a program using multiple pthreads, each running their own
 188    ``threadmaster``
 189
 190 Each roundrect represents a single pthread running the same event loop
 191 described under :ref:`event-architecture`. Note the arrow from the ``exec()``
 192 box on the right to the ``schedule()`` box in the middle pthread. This
 193 illustrates code running in one pthread scheduling a task onto another
 194 pthread's threadmaster. A global lock for each ``threadmaster`` is used to
 195 synchronize these operations. The pthread names are examples.
 196
 197
 198 .. This should be broken into its document under :ref:`libfrr`
 199 .. _kernel-thread-wrapper:
 200
 201 Kernel Thread Wrapper
 202 ^^^^^^^^^^^^^^^^^^^^^
 203 The basis for the integration of pthreads and the event system is a lightweight
 204 wrapper for both systems implemented in :file:`lib/frr_pthread.[ch]`. The
 205 header provides a core datastructure, ``struct frr_pthread``, that encapsulates
 206 structures from both POSIX threads and :file:`thread.[ch]`. In particular, this
 207 datastructure has a pointer to a ``threadmaster`` that runs within the pthread.
 208 It also has fields for a name as well as start and stop functions that have
 209 signatures similar to the POSIX arguments for ``pthread_create()``.
 210
 211 Calling ``frr_pthread_new()`` creates and registers a new ``frr_pthread``. The
 212 returned structure has a pre-initialized ``threadmaster``, and its ``start``
 213 and ``stop`` functions are initialized to defaults that will run a basic event
 214 loop with the given threadmaster. Calling ``frr_pthread_run`` starts the thread
 215 with the ``start`` function. From there, the model is the same as the regular
 216 event model. To schedule tasks on a particular pthread, simply use the regular
 217 :file:`thread.c` functions as usual and provide the ``threadmaster`` pointed to
 218 from the ``frr_pthread``. As part of implementing the wrapper, the
 219 :file:`thread.c` functions were made thread-safe. Consequently, it is safe to
 220 schedule events on a ``threadmaster`` belonging both to the calling thread as
 221 well as *any other pthread*. This serves as the basis for inter-thread
 222 communication and boils down to a slightly more complicated method of message
 223 passing, where the messages are the regular task events as used in the
 224 event-driven model. The only difference is thread cancellation, which requires
 225 calling ``thread_cancel_async()`` instead of ``thread_cancel`` to cancel a task
 226 currently scheduled on a ``threadmaster`` belonging to a different pthread.
 227 This is necessary to avoid race conditions in the specific case where one
 228 pthread wants to guarantee that a task on another pthread is cancelled before
 229 proceeding.
 230
 231 In addition, the existing commands to show statistics and other information for
 232 tasks within the event driven model have been expanded to handle multiple
 233 pthreads; running :clicmd:`show thread cpu` will display the usual event
 234 breakdown, but it will do so for each pthread running in the program. For
 235 example, :ref:`bgpd` runs a dedicated I/O pthread and shows the following
 236 output for :clicmd:`show thread cpu`:
 237
 238 ::
 239
 240    frr# show thread cpu
 241
 242    Thread statistics for bgpd:
 243
 244    Showing statistics for pthread main
 245    ------------------------------------
 246                          CPU (user+system): Real (wall-clock):
 247    Active   Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  Type  Thread
 248        0       1389.000        10   138900    248000   135549    255349   T   subgroup_coalesce_timer
 249        0          0.000         1        0         0       18        18   T   bgp_startup_timer_expire
 250        0        850.000        18    47222    222000    47795    233814   T   work_queue_run
 251        0          0.000        10        0         0        6        14   T   update_subgroup_merge_check_thread_cb
 252        0          0.000         8        0         0      117       160  W    zclient_flush_data
 253        2          2.000         1     2000      2000      831       831 R     bgp_accept
 254        0          1.000         1     1000      1000     2832      2832    E  zclient_connect
 255        1      42082.000    240574      174     37000      178     72810 R     vtysh_read
 256        1        152.000      1885       80      2000       96      6292 R     zclient_read
 257        0     549346.000   2997298      183      7000      153     20242    E  bgp_event
 258        0       2120.000       300     7066     14000     6813     22046   T   (bgp_holdtime_timer)
 259        0          0.000         2        0         0       57        59   T   update_group_refresh_default_originate_route_map
 260        0         90.000         1    90000     90000    73729     73729   T   bgp_route_map_update_timer
 261        0       1417.000      9147      154     48000      132     61998   T   bgp_process_packet
 262      300      71807.000   2995200       23      3000       24     11066   T   (bgp_connect_timer)
 263        0       1894.000     12713      148     45000      112     33606   T   (bgp_generate_updgrp_packets)
 264        0          0.000         1        0         0      105       105  W    vtysh_write
 265        0         52.000       599       86      2000      138      6992   T   (bgp_start_timer)
 266        1          1.000         8      125      1000      164       593 R     vtysh_accept
 267        0         15.000       600       25      2000       15       153   T   (bgp_routeadv_timer)
 268        0         11.000       299       36      3000       53      3128 RW    bgp_connect_check
 269
 270
 271    Showing statistics for pthread BGP I/O thread
 272    ----------------------------------------------
 273                          CPU (user+system): Real (wall-clock):
 274    Active   Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  Type  Thread
 275        0       1611.000      9296      173     13000      188     13685 R     bgp_process_reads
 276        0       2995.000     11753      254     26000      182     29355  W    bgp_process_writes
 277
 278
 279    Showing statistics for pthread BGP Keepalives thread
 280    -----------------------------------------------------
 281                          CPU (user+system): Real (wall-clock):
 282    Active   Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  Type  Thread
 283    No data to display yet.
 284
 285 Attentive readers will notice that there is a third thread, the Keepalives
 286 thread. This thread is responsible for -- surprise -- generating keepalives for
 287 peers. However, there are no statistics showing for that thread. Although the
 288 pthread uses the ``frr_pthread`` wrapper, it opts not to use the embedded
 289 ``threadmaster`` facilities. Instead it replaces the ``start`` and ``stop``
 290 functions with custom functions. This was done because the ``threadmaster``
 291 facilities introduce a small but significant amount of overhead relative to the
 292 pthread's task. In this case since the pthread does not need the event-driven
 293 model and does not need to receive tasks from other pthreads, it is simpler and
 294 more efficient to implement it outside of the provided event facilities.  The
 295 point to take away from this example is that while the facilities to make using
 296 pthreads within FRR easy are already implemented, the wrapper is flexible and
 297 allows usage of other models while still integrating with the rest of the FRR
 298 core infrastructure. Starting and stopping this pthread works the same as it
 299 does for any other ``frr_pthread``; the only difference is that event
 300 statistics are not collected for it, because there are no events.
 301
 302 Notes on Design and Documentation
 303 ---------------------------------
 304 Because of the choice to embed the existing event system into each pthread
 305 within FRR, at this time there is not integrated support for other models of
 306 pthread use such as divide and conquer. Similarly, there is no explicit support
 307 for thread pooling or similar higher level constructs. The currently existing
 308 infrastructure is designed around the concept of long-running worker threads
 309 responsible for specific jobs within each daemon. This is not to say that
 310 divide and conquer, thread pooling, etc. could not be implemented in the
 311 future. However, designs in this direction must be very careful to take into
 312 account the existing codebase. Introducing kernel threads into programs that
 313 have been written under the assumption of a single thread of execution must be
 314 done very carefully to avoid insidious errors and to ensure the program remains
 315 understandable and maintainable.
 316
 317 In keeping with these goals, future work on kernel threading should be
 318 extensively documented here and FRR developers should be very careful with
 319 their design choices, as poor choices tightly integrated can prove to be
 320 catastrophic for development efforts in the future.