doc/developer/logging.rst

   1 .. _logging:
   2
   3 Logging
   4 =======
   5
   6 One of the most frequent decisions to make while writing code for FRR is what
   7 to log, what level to log it at, and when to log it.  Here is a list of
   8 recommendations for these decisions.
   9
  10
  11 printfrr()
  12 ----------
  13
  14 ``printfrr()`` is FRR's modified version of ``printf()``, designed to make
  15 life easier when printing nontrivial datastructures.  The following variants
  16 are available:
  17
  18 .. c:function:: ssize_t snprintfrr(char *buf, size_t len, const char *fmt, ...)
  19 .. c:function:: ssize_t vsnprintfrr(char *buf, size_t len, const char *fmt, va_list)
  20
  21    These correspond to ``snprintf``/``vsnprintf``.  If you pass NULL for buf
  22    or 0 for len, no output is written but the return value is still calculated.
  23
  24    The return value is always the full length of the output, unconstrained by
  25    `len`.  It does **not** include the terminating ``\0`` character.  A
  26    malformed format string can result in a ``-1`` return value.
  27
  28 .. c:function:: ssize_t csnprintfrr(char *buf, size_t len, const char *fmt, ...)
  29 .. c:function:: ssize_t vcsnprintfrr(char *buf, size_t len, const char *fmt, va_list)
  30
  31    Same as above, but the ``c`` stands for "continue" or "concatenate".  The
  32    output is appended to the string instead of overwriting it.
  33
  34 .. c:function:: char *asprintfrr(struct memtype *mt, const char *fmt, ...)
  35 .. c:function:: char *vasprintfrr(struct memtype *mt, const char *fmt, va_list)
  36
  37    These functions allocate a dynamic buffer (using MTYPE `mt`) and print to
  38    that.  If the format string is malformed, they return a copy of the format
  39    string, so the return value is always non-NULL and always dynamically
  40    allocated with `mt`.
  41
  42 .. c:function:: char *asnprintfrr(struct memtype *mt, char *buf, size_t len, const char *fmt, ...)
  43 .. c:function:: char *vasnprintfrr(struct memtype *mt, char *buf, size_t len, const char *fmt, va_list)
  44
  45    This variant tries to use the static buffer provided, but falls back to
  46    dynamic allocation if it is insufficient.
  47
  48    The return value can be either `buf` or a newly allocated string using
  49    `mt`.  You MUST free it like this::
  50
  51       char *ret = asnprintfrr(MTYPE_FOO, buf, sizeof(buf), ...);
  52       if (ret != buf)
  53          XFREE(MTYPE_FOO, ret);
  54
  55 Extensions
  56 ^^^^^^^^^^
  57
  58 ``printfrr()`` format strings can be extended with suffixes after `%p` or
  59 `%d`.  The following extended format specifiers are available:
  60
  61 +-----------+--------------------------+----------------------------------------------+
  62 | Specifier | Argument                 | Output                                       |
  63 +===========+==========================+==============================================+
  64 | ``%Lu``   | ``uint64_t``             | ``12345``                                    |
  65 +-----------+--------------------------+----------------------------------------------+
  66 | ``%Ld``   | ``int64_t``              | ``-12345``                                   |
  67 +-----------+--------------------------+----------------------------------------------+
  68 | ``%pVA``  | ``struct va_format *``   | (recursive printfrr)                         |
  69 +-----------+--------------------------+----------------------------------------------+
  70 | ``%pI4``  | ``struct in_addr *``     | ``1.2.3.4``                                  |
  71 |           |                          |                                              |
  72 |           | ``in_addr_t *``          |                                              |
  73 +-----------+--------------------------+----------------------------------------------+
  74 | ``%pI6``  | ``struct in6_addr *``    | ``fe80::1234``                               |
  75 +-----------+--------------------------+----------------------------------------------+
  76 | ``%pIA``  | ``struct ipaddr *``      | ``1.2.3.4``                                  |
  77 |           |                          |                                              |
  78 |           |                          | ``fe80::1234``                               |
  79 +-----------+--------------------------+----------------------------------------------+
  80 | ``%pFX``  | ``struct prefix *``      | ``fe80::1234/64``                            |
  81 +-----------+--------------------------+----------------------------------------------+
  82 | ``%pSG4`` | ``struct prefix_sg *``   | ``(*,1.2.3.4)``                              |
  83 +-----------+--------------------------+----------------------------------------------+
  84 | ``%pRN``  | ``struct route_node *``  | ``192.168.1.0/24`` (dst-only node)           |
  85 |           |                          |                                              |
  86 |           |                          | ``2001:db8::/32 from fe80::/64`` (SADR node) |
  87 +-----------+--------------------------+----------------------------------------------+
  88 | ``%pNHv`` | ``struct nexthop *``     | ``1.2.3.4, via eth0``                        |
  89 +-----------+--------------------------+----------------------------------------------+
  90 | ``%pNHs`` | ``struct nexthop *``     | ``1.2.3.4 if 15``                            |
  91 +-----------+--------------------------+----------------------------------------------+
  92 | ``%pFX``  | ``struct bgp_dest *``    | ``fe80::1234/64`` (available in BGP only)    |
  93 +-----------+--------------------------+----------------------------------------------+
  94 | ``%.*pHX``| ``int len, void *ptr``   | ``12 34 56 78`` (hexdump)                    |
  95 +-----------+--------------------------+----------------------------------------------+
  96
  97 Printf features like field lengths can be used normally with these extensions,
  98 e.g. ``%-15pI4`` works correctly.
  99
 100 The extension specifier after ``%p`` or ``%d`` is always an uppercase letter;
 101 by means of established pattern uppercase letters and numbers form the type
 102 identifier which may be followed by lowercase flags.
 103
 104 You can grep the FRR source for ``printfrr_ext_autoreg`` to see all extended
 105 printers and what exactly they do.  More printers are likely to be added as
 106 needed/useful, so the list above may become outdated.
 107
 108 ``%Ld`` is not an "extension" for printfrr; it's wired directly into the main
 109 printf logic.
 110
 111 .. note::
 112
 113    The ``zlog_*``/``flog_*`` and ``vty_out`` functions all use printfrr
 114    internally, so these extensions are available there.  However, they are
 115    **not** available when calling ``snprintf`` directly.  You need to call
 116    ``snprintfrr`` instead.
 117
 118 AS-Safety
 119 ^^^^^^^^^
 120
 121 ``printfrr()`` are AS-Safe under the following conditions:
 122
 123 * the ``[v]as[n]printfrr`` variants are not AS-Safe (allocating memory)
 124 * floating point specifiers are not AS-Safe (system printf is used for these)
 125 * the positional ``%1$d`` syntax should not be used (8 arguments are supported
 126   while AS-Safe)
 127 * extensions are only AS-Safe if their printer is AS-Safe
 128
 129 Log levels
 130 ----------
 131
 132 Errors and warnings
 133 ^^^^^^^^^^^^^^^^^^^
 134
 135 If it is something that the user will want to look at and maybe do
 136 something, it is either an **error** or a **warning**.
 137
 138 We're expecting that warnings and errors are in some way visible to the
 139 user (in the worst case by looking at the log after the network broke, but
 140 maybe by a syslog collector from all routers.)  Therefore, anything that
 141 needs to get the user in the loop—and only these things—are warnings or
 142 errors.
 143
 144 Note that this doesn't necessarily mean the user needs to fix something in
 145 the FRR instance.  It also includes when we detect something else needs
 146 fixing, for example another router, the system we're running on, or the
 147 configuration.  The common point is that the user should probably do
 148 *something*.
 149
 150 Deciding between a warning and an error is slightly less obvious; the rule
 151 of thumb here is that an error will cause considerable fallout beyond its
 152 direct effect.  Closing a BGP session due to a malformed update is an error
 153 since all routes from the peer are dropped; discarding one route because
 154 its attributes don't make sense is a warning.
 155
 156 This also loosely corresponds to the kind of reaction we're expecting from
 157 the user.  An error is likely to need immediate response while a warning
 158 might be snoozed for a bit and addressed as part of general maintenance.
 159 If a problem will self-repair (e.g. by retransmits), it should be a
 160 warning—unless the impact until that self-repair is very harsh.
 161
 162 Examples for warnings:
 163
 164 * a BGP update, LSA or LSP could not be processed, but operation is
 165   proceeding and the broken pieces are likely to self-fix later
 166 * some kind of controller cannot be reached, but we can work without it
 167 * another router is using some unknown or unsupported capability
 168
 169 Examples for errors:
 170
 171 * dropping a BGP session due to malformed data
 172 * a socket for routing protocol operation cannot be opened
 173 * desynchronization from network state because something went wrong
 174 * *everything that we as developers would really like to be notified about,
 175   i.e. some assumption in the code isn't holding up*
 176
 177
 178 Informational messages
 179 ^^^^^^^^^^^^^^^^^^^^^^
 180
 181 Anything that provides introspection to the user during normal operation
 182 is an **info** message.
 183
 184 This includes all kinds of operational state transitions and events,
 185 especially if they might be interesting to the user during the course of
 186 figuring out a warning or an error.
 187
 188 By itself, these messages should mostly be statements of fact.  They might
 189 indicate the order and relationship in which things happened.  Also covered
 190 are conditions that might be "operational issues" like a link failure due
 191 to an unplugged cable.  If it's pretty much the point of running a routing
 192 daemon for, it's not a warning or an error, just business as usual.
 193
 194 The user should be able to see the state of these bits from operational
 195 state output, i.e. `show interface` or `show foobar neighbors`.  The log
 196 message indicating the change may have been printed weeks ago, but the
 197 state can always be viewed.  (If some state change has an info message but
 198 no "show" command, maybe that command needs to be added.)
 199
 200 Examples:
 201
 202 * all kinds of up/down state changes
 203
 204   * interface coming up or going down
 205   * addresses being added or deleted
 206   * peers and neighbors coming up or going down
 207
 208 * rejection of some routes due to user-configured route maps
 209 * backwards compatibility handling because another system on the network
 210   has a different or smaller feature set
 211
 212 .. note::
 213    The previously used **notify** priority is replaced with *info* in all
 214    cases.  We don't currently have a well-defined use case for it.
 215
 216
 217 Debug messages and asserts
 218 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 219
 220 Everything that is only interesting on-demand, or only while developing,
 221 is a **debug** message.  It might be interesting to the user for a
 222 particularly evasive issue, but in general these are details that an
 223 average user might not even be able to make sense of.
 224
 225 Most (or all?) debug messages should be behind a `debug foobar` category
 226 switch that controls which subset of these messages is currently
 227 interesting and thus printed.  If a debug message doesn't have such a
 228 guard, there should be a good explanation as to why.
 229
 230 Conversely, debug messages are the only thing that should be guarded by
 231 these switches.  Neither info nor warning or error messages should be
 232 hidden in this way.
 233
 234 **Asserts** should only be used as pretty crashes.  We are expecting that
 235 asserts remain enabled in production builds, but please try to not use
 236 asserts in a way that would cause a security problem if the assert wasn't
 237 there (i.e. don't use them for length checks.)
 238
 239 The purpose of asserts is mainly to help development and bug hunting.  If
 240 the daemon crashes, then having some more information is nice, and the
 241 assert can provide crucial hints that cut down on the time needed to track
 242 an issue.  That said, if the issue can be reasonably handled and/or isn't
 243 going to crash the daemon, it shouldn't be an assert.
 244
 245 For anything else where internal constraints are violated but we're not
 246 breaking due to it, it's an error instead (not a debug.)  These require
 247 "user action" of notifying the developers.
 248
 249 Examples:
 250
 251 * mismatched :code:`prev`/:code:`next` pointers in lists
 252 * some field that is absolutely needed is :code:`NULL`
 253 * any other kind of data structure corruption that will cause the daemon
 254   to crash sooner or later, one way or another
 255
 256 Thread-local buffering
 257 ----------------------
 258
 259 The core logging code in :file:`lib/zlog.c` allows setting up per-thread log
 260 message buffers in order to improve logging performance.  The following rules
 261 apply for this buffering:
 262
 263 * Only messages of priority *DEBUG* or *INFO* are buffered.
 264 * Any higher-priority message causes the thread's entire buffer to be flushed,
 265   thus message ordering is preserved on a per-thread level.
 266 * There is no guarantee on ordering between different threads;  in most cases
 267   this is arbitrary to begin with since the threads essentially race each
 268   other in printing log messages.  If an order is established with some
 269   synchronization primitive, add calls to :c:func:`zlog_tls_buffer_flush()`.
 270 * The buffers are only ever accessed by the thread they are created by.  This
 271   means no locking is necessary.
 272
 273 Both the main/default thread and additional threads created by
 274 :c:func:`frr_pthread_new()` with the default :c:func:`frr_run()` handler will
 275 initialize thread-local buffering and call :c:func:`zlog_tls_buffer_flush()`
 276 when idle.
 277
 278 If some piece of code runs for an extended period, it may be useful to insert
 279 calls to :c:func:`zlog_tls_buffer_flush()` in appropriate places:
 280
 281 .. c:function:: void zlog_tls_buffer_flush(void)
 282
 283    Write out any pending log messages that the calling thread may have in its
 284    buffer.  This function is safe to call regardless of the per-thread log
 285    buffer being set up / in use or not.
 286
 287 When working with threads that do not use the :c:type:`struct thread_master`
 288 event loop, per-thread buffers can be managed with:
 289
 290 .. c:function:: void zlog_tls_buffer_init(void)
 291
 292    Set up thread-local buffering for log messages.  This function may be
 293    called repeatedly without adverse effects, but remember to call
 294    :c:func:`zlog_tls_buffer_fini()` at thread exit.
 295
 296    .. warning::
 297
 298       If this function is called, but :c:func:`zlog_tls_buffer_flush()` is
 299       not used, log message output will lag behind since messages will only be
 300       written out when the buffer is full.
 301
 302       Exiting the thread without calling :c:func:`zlog_tls_buffer_fini()`
 303       will cause buffered log messages to be lost.
 304
 305 .. c:function:: void zlog_tls_buffer_fini(void)
 306
 307    Flush pending messages and tear down thread-local log message buffering.
 308    This function may be called repeatedly regardless of whether
 309    :c:func:`zlog_tls_buffer_init()` was ever called.
 310
 311 Log targets
 312 -----------
 313
 314 The actual logging subsystem (in :file:`lib/zlog.c`) is heavily separated
 315 from the actual log writers.  It uses an atomic linked-list (`zlog_targets`)
 316 with RCU to maintain the log targets to be called.  This list is intended to
 317 function as "backend" only, it **is not used for configuration**.
 318
 319 Logging targets provide their configuration layer on top of this and maintain
 320 their own capability to enumerate and store their configuration.  Some targets
 321 (e.g. syslog) are inherently single instance and just stuff their config in
 322 global variables.  Others (e.g. file/fd output) are multi-instance capable.
 323 There is another layer boundary here between these and the VTY configuration
 324 that they use.
 325
 326 Basic internals
 327 ^^^^^^^^^^^^^^^
 328
 329 .. c:type:: struct zlog_target
 330
 331    This struct needs to be filled in by any log target and then passed to
 332    :c:func:`zlog_target_replace()`.  After it has been registered,
 333    **RCU semantics apply**.  Most changes to associated data should make a
 334    copy, change that, and then replace the entire struct.
 335
 336    Additional per-target data should be "appended" by embedding this struct
 337    into a larger one, for use with `containerof()`, and
 338    :c:func:`zlog_target_clone()` and :c:func:`zlog_target_free()` should be
 339    used to allocate/free the entire container struct.
 340
 341    Do not use this structure to maintain configuration.  It should only
 342    contain (a copy of) the data needed to perform the actual logging.  For
 343    example, the syslog target uses this:
 344
 345    .. code-block:: c
 346
 347       struct zlt_syslog {
 348           struct zlog_target zt;
 349           int syslog_facility;
 350       };
 351
 352       static void zlog_syslog(struct zlog_target *zt, struct zlog_msg *msgs[], size_t nmsgs)
 353       {
 354           struct zlt_syslog *zte = container_of(zt, struct zlt_syslog, zt);
 355           size_t i;
 356
 357           for (i = 0; i < nmsgs; i++)
 358               if (zlog_msg_prio(msgs[i]) <= zt->prio_min)
 359                   syslog(zlog_msg_prio(msgs[i]) | zte->syslog_facility, "%s",
 360                          zlog_msg_text(msgs[i], NULL));
 361       }
 362
 363
 364 .. c:function:: struct zlog_target *zlog_target_clone(struct memtype *mt, struct zlog_target *oldzt, size_t size)
 365
 366    Allocates a logging target struct.  Note that the ``oldzt`` argument may be
 367    ``NULL`` to allocate a "from scratch".  If ``oldzt`` is not ``NULL``, the
 368    generic bits in :c:type:`struct zlog_target` are copied.  **Target specific
 369    bits are not copied.**
 370
 371 .. c:function:: struct zlog_target *zlog_target_replace(struct zlog_target *oldzt, struct zlog_target *newzt)
 372
 373    Adds, replaces or deletes a logging target (either ``oldzt`` or ``newzt`` may be ``NULL``.)
 374
 375    Returns ``oldzt`` for freeing.  The target remains possibly in use by
 376    other threads until the RCU cycle ends.  This implies you cannot release
 377    resources (e.g. memory, file descriptors) immediately.
 378
 379    The replace operation is not atomic; for a brief period it is possible that
 380    messages are delivered on both ``oldzt`` and ``newzt``.
 381
 382    .. warning::
 383
 384       ``oldzt`` must remain **functional** until the RCU cycle ends.
 385
 386 .. c:function:: void zlog_target_free(struct memtype *mt, struct zlog_target *zt)
 387
 388    Counterpart to :c:func:`zlog_target_clone()`, frees a target (using RCU.)
 389
 390 .. c:member:: void (*zlog_target.logfn)(struct zlog_target *zt, struct zlog_msg *msgs[], size_t nmsg)
 391
 392    Called on a target to deliver "normal" logging messages.  ``msgs`` is an
 393    array of opaque structs containing the actual message.  Use ``zlog_msg_*``
 394    functions to access message data (this is done to allow some optimizations,
 395    e.g.  lazy formatting the message text and timestamp as needed.)
 396
 397    .. note::
 398
 399       ``logfn()`` must check each individual message's priority value against
 400       the configured ``prio_min``.  While the ``prio_min`` field is common to
 401       all targets and used by the core logging code to early-drop unneeded log
 402       messages, the array is **not** filtered for each ``logfn()`` call.
 403
 404 .. c:member:: void (*zlog_target.logfn_sigsafe)(struct zlog_target *zt, const char *text, size_t len)
 405
 406    Called to deliver "exception" logging messages (i.e. SEGV messages.)
 407    Must be Async-Signal-Safe (may not allocate memory or call "complicated"
 408    libc functions.)  May be ``NULL`` if the log target cannot handle this.
 409
 410 Standard targets
 411 ^^^^^^^^^^^^^^^^
 412
 413 :file:`lib/zlog_targets.c` provides the standard file / fd / syslog targets.
 414 The syslog target is single-instance while file / fd targets can be
 415 instantiated as needed.  There are 3 built-in targets that are fully
 416 autonomous without any config:
 417
 418 - startup logging to `stderr`, until either :c:func:`zlog_startup_end()` or
 419   :c:func:`zlog_aux_init()` is called.
 420 - stdout logging for non-daemon programs using :c:func:`zlog_aux_init()`
 421 - crashlogs written to :file:`/var/tmp/frr.daemon.crashlog`
 422
 423 The regular CLI/command-line logging setup is handled by :file:`lib/log_vty.c`
 424 which makes the appropriate instantiations of syslog / file / fd targets.
 425
 426 .. todo::
 427
 428   :c:func:`zlog_startup_end()` should do an explicit switchover from
 429   startup stderr logging to configured logging.  Currently, configured logging
 430   starts in parallel as soon as the respective setup is executed.  This results
 431   in some duplicate logging.