doc/developer/process-architecture.rst

   1 .. _process-architecture:
   2
   3 Process Architecture
   4 ====================
   5
   6 FRR is a suite of daemons that serve different functions. This document
   7 describes internal architecture of daemons, focusing their general design
   8 patterns, and especially how threads are used in the daemons that use them.
   9
  10 Overview
  11 --------
  12 The fundamental pattern used in FRR daemons is an `event loop
  13 <https://en.wikipedia.org/wiki/Event_loop>`_. Some daemons use `kernel threads
  14 <https://en.wikipedia.org/wiki/Thread_(computing)#Kernel_threads>`_. In these
  15 daemons, each kernel thread runs its own event loop. The event loop
  16 implementation is constructed to be thread safe and to allow threads other than
  17 its owning thread to schedule events on it. The rest of this document describes
  18 these two designs in detail.
  19
  20 Terminology
  21 -----------
  22 Because this document describes the architecture for kernel threads as well as
  23 the event system, a digression on terminology is in order here.
  24
  25 Historically Quagga's loop system was viewed as an implementation of userspace
  26 threading. Because of this design choice, the names for various datastructures
  27 within the event system are variations on the term "thread". The primary
  28 datastructure that holds the state of an event loop in this system is called a
  29 "threadmaster". Events scheduled on the event loop - what would today be called
  30 an 'event' or 'task' in systems such as libevent - are called "threads" and the
  31 datastructure for them is ``struct event``. To add to the confusion, these
  32 "threads" have various types, one of which is "event". To hopefully avoid some
  33 of this confusion, this document refers to these "threads" as a 'task' except
  34 where the datastructures are explicitly named. When they are explicitly named,
  35 they will be formatted ``like this`` to differentiate from the conceptual
  36 names. When speaking of kernel threads, the term used will be "pthread" since
  37 FRR's kernel threading implementation uses the POSIX threads API.
  38
  39 .. This should be broken into its document under :ref:`libfrr`
  40 .. _event-architecture:
  41
  42 Event Architecture
  43 ------------------
  44 This section presents a brief overview of the event model as currently
  45 implemented in FRR. This doc should be expanded and broken off into its own
  46 section. For now it provides basic information necessary to understand the
  47 interplay between the event system and kernel threads.
  48
  49 The core event system is implemented in :file:`lib/thread.[ch]`. The primary
  50 structure is ``struct event_loop``, hereafter referred to as a
  51 ``threadmaster``. A ``threadmaster`` is a global state object, or context, that
  52 holds all the tasks currently pending execution as well as statistics on tasks
  53 that have already executed. The event system is driven by adding tasks to this
  54 data structure and then calling a function to retrieve the next task to
  55 execute. At initialization, a daemon will typically create one
  56 ``threadmaster``, add a small set of initial tasks, and then run a loop to
  57 fetch each task and execute it.
  58
  59 These tasks have various types corresponding to their general action. The types
  60 are given by integer macros in :file:`event.h` and are:
  61
  62 ``THREAD_READ``
  63    Task which waits for a file descriptor to become ready for reading and then
  64    executes.
  65
  66 ``THREAD_WRITE``
  67    Task which waits for a file descriptor to become ready for writing and then
  68    executes.
  69
  70 ``THREAD_TIMER``
  71    Task which executes after a certain amount of time has passed since it was
  72    scheduled.
  73
  74 ``THREAD_EVENT``
  75    Generic task that executes with high priority and carries an arbitrary
  76    integer indicating the event type to its handler. These are commonly used to
  77    implement the finite state machines typically found in routing protocols.
  78
  79 ``THREAD_READY``
  80    Type used internally for tasks on the ready queue.
  81
  82 ``THREAD_UNUSED``
  83    Type used internally for ``struct event`` objects that aren't being used.
  84    The event system pools ``struct event`` to avoid heap allocations; this is
  85    the type they have when they're in the pool.
  86
  87 ``THREAD_EXECUTE``
  88    Just before a task is run its type is changed to this. This is used to show
  89    ``X`` as the type in the output of :clicmd:`show thread cpu`.
  90
  91 The programmer never has to work with these types explicitly. Each type of task
  92 is created and queued via special-purpose functions (actually macros, but
  93 irrelevant for the time being) for the specific type. For example, to add a
  94 ``THREAD_READ`` task, you would call
  95
  96 ::
  97
  98    event_add_read(struct event_loop *master, int (*handler)(struct event *), void *arg, int fd, struct event **ref);
  99
 100 The ``struct event`` is then created and added to the appropriate internal
 101 datastructure within the ``threadmaster``. Note that the ``READ`` and
 102 ``WRITE`` tasks are independent - a ``READ`` task only tests for
 103 readability, for example.
 104
 105 The Event Loop
 106 ^^^^^^^^^^^^^^
 107 To use the event system, after creating a ``threadmaster`` the program adds an
 108 initial set of tasks. As these tasks execute, they add more tasks that execute
 109 at some point in the future. This sequence of tasks drives the lifecycle of the
 110 program. When no more tasks are available, the program dies. Typically at
 111 startup the first task added is an I/O task for VTYSH as well as any network
 112 sockets needed for peerings or IPC.
 113
 114 To retrieve the next task to run the program calls ``event_fetch()``.
 115 ``event_fetch()`` internally computes which task to execute next based on
 116 rudimentary priority logic. Events (type ``THREAD_EVENT``) execute with the
 117 highest priority, followed by expired timers and finally I/O tasks (type
 118 ``THREAD_READ`` and ``THREAD_WRITE``). When scheduling a task a function and an
 119 arbitrary argument are provided. The task returned from ``event_fetch()`` is
 120 then executed with ``event_call()``.
 121
 122 The following diagram illustrates a simplified version of this infrastructure.
 123
 124 .. todo: replace these with SVG
 125 .. figure:: ../figures/threadmaster-single.png
 126    :align: center
 127
 128    Lifecycle of a program using a single threadmaster.
 129
 130 The series of "task" boxes represents the current ready task queue. The various
 131 other queues for other types are not shown. The fetch-execute loop is
 132 illustrated at the bottom.
 133
 134 Mapping the general names used in the figure to specific FRR functions:
 135
 136 - ``task`` is ``struct event *``
 137 - ``fetch`` is ``event_fetch()``
 138 - ``exec()`` is ``event_call``
 139 - ``cancel()`` is ``event_cancel()``
 140 - ``schedule()`` is any of the various task-specific ``event_add_*`` functions
 141
 142 Adding tasks is done with various task-specific function-like macros. These
 143 macros wrap underlying functions in :file:`thread.c` to provide additional
 144 information added at compile time, such as the line number the task was
 145 scheduled from, that can be accessed at runtime for debugging, logging and
 146 informational purposes. Each task type has its own specific scheduling function
 147 that follow the naming convention ``event_add_<type>``; see :file:`event.h`
 148 for details.
 149
 150 There are some gotchas to keep in mind:
 151
 152 - I/O tasks are keyed off the file descriptor associated with the I/O
 153   operation. This means that for any given file descriptor, only one of each
 154   type of I/O task (``THREAD_READ`` and ``THREAD_WRITE``) can be scheduled. For
 155   example, scheduling two write tasks one after the other will overwrite the
 156   first task with the second, resulting in total loss of the first task and
 157   difficult bugs.
 158
 159 - Timer tasks are only as accurate as the monotonic clock provided by the
 160   underlying operating system.
 161
 162 - Memory management of the arbitrary handler argument passed in the schedule
 163   call is the responsibility of the caller.
 164
 165
 166 Kernel Thread Architecture
 167 --------------------------
 168 Efforts have begun to introduce kernel threads into FRR to improve performance
 169 and stability. Naturally a kernel thread architecture has long been seen as
 170 orthogonal to an event-driven architecture, and the two do have significant
 171 overlap in terms of design choices. Since the event model is tightly integrated
 172 into FRR, careful thought has been put into how pthreads are introduced, what
 173 role they fill, and how they will interoperate with the event model.
 174
 175 Design Overview
 176 ^^^^^^^^^^^^^^^
 177 Each kernel thread behaves as a lightweight process within FRR, sharing the
 178 same process memory space. On the other hand, the event system is designed to
 179 run in a single process and drive serial execution of a set of tasks. With this
 180 consideration, a natural choice is to implement the event system within each
 181 kernel thread. This allows us to leverage the event-driven execution model with
 182 the currently existing task and context primitives. In this way the familiar
 183 execution model of FRR gains the ability to execute tasks simultaneously while
 184 preserving the existing model for concurrency.
 185
 186 The following figure illustrates the architecture with multiple pthreads, each
 187 running their own ``threadmaster``-based event loop.
 188
 189 .. todo: replace these with SVG
 190 .. figure:: ../figures/threadmaster-multiple.png
 191    :align: center
 192
 193    Lifecycle of a program using multiple pthreads, each running their own
 194    ``threadmaster``
 195
 196 Each roundrect represents a single pthread running the same event loop
 197 described under :ref:`event-architecture`. Note the arrow from the ``exec()``
 198 box on the right to the ``schedule()`` box in the middle pthread. This
 199 illustrates code running in one pthread scheduling a task onto another
 200 pthread's threadmaster. A global lock for each ``threadmaster`` is used to
 201 synchronize these operations. The pthread names are examples.
 202
 203
 204 .. This should be broken into its document under :ref:`libfrr`
 205 .. _kernel-thread-wrapper:
 206
 207 Kernel Thread Wrapper
 208 ^^^^^^^^^^^^^^^^^^^^^
 209 The basis for the integration of pthreads and the event system is a lightweight
 210 wrapper for both systems implemented in :file:`lib/frr_pthread.[ch]`. The
 211 header provides a core datastructure, ``struct frr_pthread``, that encapsulates
 212 structures from both POSIX threads and :file:`thread.[ch]`. In particular, this
 213 datastructure has a pointer to a ``threadmaster`` that runs within the pthread.
 214 It also has fields for a name as well as start and stop functions that have
 215 signatures similar to the POSIX arguments for ``pthread_create()``.
 216
 217 Calling ``frr_pthread_new()`` creates and registers a new ``frr_pthread``. The
 218 returned structure has a pre-initialized ``threadmaster``, and its ``start``
 219 and ``stop`` functions are initialized to defaults that will run a basic event
 220 loop with the given threadmaster. Calling ``frr_pthread_run`` starts the thread
 221 with the ``start`` function. From there, the model is the same as the regular
 222 event model. To schedule tasks on a particular pthread, simply use the regular
 223 :file:`thread.c` functions as usual and provide the ``threadmaster`` pointed to
 224 from the ``frr_pthread``. As part of implementing the wrapper, the
 225 :file:`thread.c` functions were made thread-safe. Consequently, it is safe to
 226 schedule events on a ``threadmaster`` belonging both to the calling thread as
 227 well as *any other pthread*. This serves as the basis for inter-thread
 228 communication and boils down to a slightly more complicated method of message
 229 passing, where the messages are the regular task events as used in the
 230 event-driven model. The only difference is thread cancellation, which requires
 231 calling ``event_cancel_async()`` instead of ``event_cancel`` to cancel a task
 232 currently scheduled on a ``threadmaster`` belonging to a different pthread.
 233 This is necessary to avoid race conditions in the specific case where one
 234 pthread wants to guarantee that a task on another pthread is cancelled before
 235 proceeding.
 236
 237 In addition, the existing commands to show statistics and other information for
 238 tasks within the event driven model have been expanded to handle multiple
 239 pthreads; running :clicmd:`show thread cpu` will display the usual event
 240 breakdown, but it will do so for each pthread running in the program. For
 241 example, :ref:`bgpd` runs a dedicated I/O pthread and shows the following
 242 output for :clicmd:`show thread cpu`:
 243
 244 ::
 245
 246    frr# show thread cpu
 247
 248    Thread statistics for bgpd:
 249
 250    Showing statistics for pthread main
 251    ------------------------------------
 252                          CPU (user+system): Real (wall-clock):
 253    Active   Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  Type  Thread
 254        0       1389.000        10   138900    248000   135549    255349   T   subgroup_coalesce_timer
 255        0          0.000         1        0         0       18        18   T   bgp_startup_timer_expire
 256        0        850.000        18    47222    222000    47795    233814   T   work_queue_run
 257        0          0.000        10        0         0        6        14   T   update_subgroup_merge_check_thread_cb
 258        0          0.000         8        0         0      117       160  W    zclient_flush_data
 259        2          2.000         1     2000      2000      831       831 R     bgp_accept
 260        0          1.000         1     1000      1000     2832      2832    E  zclient_connect
 261        1      42082.000    240574      174     37000      178     72810 R     vtysh_read
 262        1        152.000      1885       80      2000       96      6292 R     zclient_read
 263        0     549346.000   2997298      183      7000      153     20242    E  bgp_event
 264        0       2120.000       300     7066     14000     6813     22046   T   (bgp_holdtime_timer)
 265        0          0.000         2        0         0       57        59   T   update_group_refresh_default_originate_route_map
 266        0         90.000         1    90000     90000    73729     73729   T   bgp_route_map_update_timer
 267        0       1417.000      9147      154     48000      132     61998   T   bgp_process_packet
 268      300      71807.000   2995200       23      3000       24     11066   T   (bgp_connect_timer)
 269        0       1894.000     12713      148     45000      112     33606   T   (bgp_generate_updgrp_packets)
 270        0          0.000         1        0         0      105       105  W    vtysh_write
 271        0         52.000       599       86      2000      138      6992   T   (bgp_start_timer)
 272        1          1.000         8      125      1000      164       593 R     vtysh_accept
 273        0         15.000       600       25      2000       15       153   T   (bgp_routeadv_timer)
 274        0         11.000       299       36      3000       53      3128 RW    bgp_connect_check
 275
 276
 277    Showing statistics for pthread BGP I/O thread
 278    ----------------------------------------------
 279                          CPU (user+system): Real (wall-clock):
 280    Active   Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  Type  Thread
 281        0       1611.000      9296      173     13000      188     13685 R     bgp_process_reads
 282        0       2995.000     11753      254     26000      182     29355  W    bgp_process_writes
 283
 284
 285    Showing statistics for pthread BGP Keepalives thread
 286    -----------------------------------------------------
 287                          CPU (user+system): Real (wall-clock):
 288    Active   Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  Type  Thread
 289    No data to display yet.
 290
 291 Attentive readers will notice that there is a third thread, the Keepalives
 292 thread. This thread is responsible for -- surprise -- generating keepalives for
 293 peers. However, there are no statistics showing for that thread. Although the
 294 pthread uses the ``frr_pthread`` wrapper, it opts not to use the embedded
 295 ``threadmaster`` facilities. Instead it replaces the ``start`` and ``stop``
 296 functions with custom functions. This was done because the ``threadmaster``
 297 facilities introduce a small but significant amount of overhead relative to the
 298 pthread's task. In this case since the pthread does not need the event-driven
 299 model and does not need to receive tasks from other pthreads, it is simpler and
 300 more efficient to implement it outside of the provided event facilities.  The
 301 point to take away from this example is that while the facilities to make using
 302 pthreads within FRR easy are already implemented, the wrapper is flexible and
 303 allows usage of other models while still integrating with the rest of the FRR
 304 core infrastructure. Starting and stopping this pthread works the same as it
 305 does for any other ``frr_pthread``; the only difference is that event
 306 statistics are not collected for it, because there are no events.
 307
 308 Notes on Design and Documentation
 309 ---------------------------------
 310 Because of the choice to embed the existing event system into each pthread
 311 within FRR, at this time there is not integrated support for other models of
 312 pthread use such as divide and conquer. Similarly, there is no explicit support
 313 for thread pooling or similar higher level constructs. The currently existing
 314 infrastructure is designed around the concept of long-running worker threads
 315 responsible for specific jobs within each daemon. This is not to say that
 316 divide and conquer, thread pooling, etc. could not be implemented in the
 317 future. However, designs in this direction must be very careful to take into
 318 account the existing codebase. Introducing kernel threads into programs that
 319 have been written under the assumption of a single thread of execution must be
 320 done very carefully to avoid insidious errors and to ensure the program remains
 321 understandable and maintainable.
 322
 323 In keeping with these goals, future work on kernel threading should be
 324 extensively documented here and FRR developers should be very careful with
 325 their design choices, as poor choices tightly integrated can prove to be
 326 catastrophic for development efforts in the future.