doc/developer/logging.rst

   1 .. _logging:
   2
   3 Logging
   4 =======
   5
   6 One of the most frequent decisions to make while writing code for FRR is what
   7 to log, what level to log it at, and when to log it.  Here is a list of
   8 recommendations for these decisions.
   9
  10
  11 printfrr()
  12 ----------
  13
  14 ``printfrr()`` is FRR's modified version of ``printf()``, designed to make
  15 life easier when printing nontrivial datastructures.  The following variants
  16 are available:
  17
  18 .. c:function:: ssize_t snprintfrr(char *buf, size_t len, const char *fmt, ...)
  19 .. c:function:: ssize_t vsnprintfrr(char *buf, size_t len, const char *fmt, va_list)
  20
  21    These correspond to ``snprintf``/``vsnprintf``.  If you pass NULL for buf
  22    or 0 for len, no output is written but the return value is still calculated.
  23
  24    The return value is always the full length of the output, unconstrained by
  25    `len`.  It does **not** include the terminating ``\0`` character.  A
  26    malformed format string can result in a ``-1`` return value.
  27
  28 .. c:function:: ssize_t csnprintfrr(char *buf, size_t len, const char *fmt, ...)
  29 .. c:function:: ssize_t vcsnprintfrr(char *buf, size_t len, const char *fmt, va_list)
  30
  31    Same as above, but the ``c`` stands for "continue" or "concatenate".  The
  32    output is appended to the string instead of overwriting it.
  33
  34 .. c:function:: char *asprintfrr(struct memtype *mt, const char *fmt, ...)
  35 .. c:function:: char *vasprintfrr(struct memtype *mt, const char *fmt, va_list)
  36
  37    These functions allocate a dynamic buffer (using MTYPE `mt`) and print to
  38    that.  If the format string is malformed, they return a copy of the format
  39    string, so the return value is always non-NULL and always dynamically
  40    allocated with `mt`.
  41
  42 .. c:function:: char *asnprintfrr(struct memtype *mt, char *buf, size_t len, const char *fmt, ...)
  43 .. c:function:: char *vasnprintfrr(struct memtype *mt, char *buf, size_t len, const char *fmt, va_list)
  44
  45    This variant tries to use the static buffer provided, but falls back to
  46    dynamic allocation if it is insufficient.
  47
  48    The return value can be either `buf` or a newly allocated string using
  49    `mt`.  You MUST free it like this::
  50
  51       char *ret = asnprintfrr(MTYPE_FOO, buf, sizeof(buf), ...);
  52       if (ret != buf)
  53          XFREE(MTYPE_FOO, ret);
  54
  55 Extensions
  56 ^^^^^^^^^^
  57
  58 ``printfrr()`` format strings can be extended with suffixes after `%p` or
  59 `%d`.  The following extended format specifiers are available:
  60
  61 +-----------+--------------------------+----------------------------------------------+
  62 | Specifier | Argument                 | Output                                       |
  63 +===========+==========================+==============================================+
  64 | ``%Lu``   | ``uint64_t``             | ``12345``                                    |
  65 +-----------+--------------------------+----------------------------------------------+
  66 | ``%Ld``   | ``int64_t``              | ``-12345``                                   |
  67 +-----------+--------------------------+----------------------------------------------+
  68 | ``%pI4``  | ``struct in_addr *``     | ``1.2.3.4``                                  |
  69 |           |                          |                                              |
  70 |           | ``in_addr_t *``          |                                              |
  71 +-----------+--------------------------+----------------------------------------------+
  72 | ``%pI6``  | ``struct in6_addr *``    | ``fe80::1234``                               |
  73 +-----------+--------------------------+----------------------------------------------+
  74 | ``%pFX``  | ``struct prefix *``      | ``fe80::1234/64``                            |
  75 +-----------+--------------------------+----------------------------------------------+
  76 | ``%pSG4`` | ``struct prefix_sg *``   | ``(*,1.2.3.4)``                              |
  77 +-----------+--------------------------+----------------------------------------------+
  78 | ``%pRN``  | ``struct route_node *``  | ``192.168.1.0/24`` (dst-only node)           |
  79 |           |                          |                                              |
  80 |           |                          | ``2001:db8::/32 from fe80::/64`` (SADR node) |
  81 +-----------+--------------------------+----------------------------------------------+
  82 | ``%pNHv`` | ``struct nexthop *``     | ``1.2.3.4, via eth0``                        |
  83 +-----------+--------------------------+----------------------------------------------+
  84 | ``%pNHs`` | ``struct nexthop *``     | ``1.2.3.4 if 15``                            |
  85 +-----------+--------------------------+----------------------------------------------+
  86 | ``%pFX``  | ``struct bgp_dest *``    | ``fe80::1234/64`` (available in BGP only)    |
  87 +-----------+--------------------------+----------------------------------------------+
  88
  89 Printf features like field lengths can be used normally with these extensions,
  90 e.g. ``%-15pI4`` works correctly.
  91
  92 The extension specifier after ``%p`` or ``%d`` is always an uppercase letter;
  93 by means of established pattern uppercase letters and numbers form the type
  94 identifier which may be followed by lowercase flags.
  95
  96 You can grep the FRR source for ``printfrr_ext_autoreg`` to see all extended
  97 printers and what exactly they do.  More printers are likely to be added as
  98 needed/useful, so the list above may become outdated.
  99
 100 ``%Ld`` is not an "extension" for printfrr; it's wired directly into the main
 101 printf logic.
 102
 103 .. note::
 104
 105    The ``zlog_*``/``flog_*`` and ``vty_out`` functions all use printfrr
 106    internally, so these extensions are available there.  However, they are
 107    **not** available when calling ``snprintf`` directly.  You need to call
 108    ``snprintfrr`` instead.
 109
 110 AS-Safety
 111 ^^^^^^^^^
 112
 113 ``printfrr()`` are AS-Safe under the following conditions:
 114
 115 * the ``[v]as[n]printfrr`` variants are not AS-Safe (allocating memory)
 116 * floating point specifiers are not AS-Safe (system printf is used for these)
 117 * the positional ``%1$d`` syntax should not be used (8 arguments are supported
 118   while AS-Safe)
 119 * extensions are only AS-Safe if their printer is AS-Safe
 120
 121 Log levels
 122 ----------
 123
 124 Errors and warnings
 125 ^^^^^^^^^^^^^^^^^^^
 126
 127 If it is something that the user will want to look at and maybe do
 128 something, it is either an **error** or a **warning**.
 129
 130 We're expecting that warnings and errors are in some way visible to the
 131 user (in the worst case by looking at the log after the network broke, but
 132 maybe by a syslog collector from all routers.)  Therefore, anything that
 133 needs to get the user in the loop—and only these things—are warnings or
 134 errors.
 135
 136 Note that this doesn't necessarily mean the user needs to fix something in
 137 the FRR instance.  It also includes when we detect something else needs
 138 fixing, for example another router, the system we're running on, or the
 139 configuration.  The common point is that the user should probably do
 140 *something*.
 141
 142 Deciding between a warning and an error is slightly less obvious; the rule
 143 of thumb here is that an error will cause considerable fallout beyond its
 144 direct effect.  Closing a BGP session due to a malformed update is an error
 145 since all routes from the peer are dropped; discarding one route because
 146 its attributes don't make sense is a warning.
 147
 148 This also loosely corresponds to the kind of reaction we're expecting from
 149 the user.  An error is likely to need immediate response while a warning
 150 might be snoozed for a bit and addressed as part of general maintenance.
 151 If a problem will self-repair (e.g. by retransmits), it should be a
 152 warning—unless the impact until that self-repair is very harsh.
 153
 154 Examples for warnings:
 155
 156 * a BGP update, LSA or LSP could not be processed, but operation is
 157   proceeding and the broken pieces are likely to self-fix later
 158 * some kind of controller cannot be reached, but we can work without it
 159 * another router is using some unknown or unsupported capability
 160
 161 Examples for errors:
 162
 163 * dropping a BGP session due to malformed data
 164 * a socket for routing protocol operation cannot be opened
 165 * desynchronization from network state because something went wrong
 166 * *everything that we as developers would really like to be notified about,
 167   i.e. some assumption in the code isn't holding up*
 168
 169
 170 Informational messages
 171 ^^^^^^^^^^^^^^^^^^^^^^
 172
 173 Anything that provides introspection to the user during normal operation
 174 is an **info** message.
 175
 176 This includes all kinds of operational state transitions and events,
 177 especially if they might be interesting to the user during the course of
 178 figuring out a warning or an error.
 179
 180 By itself, these messages should mostly be statements of fact.  They might
 181 indicate the order and relationship in which things happened.  Also covered
 182 are conditions that might be "operational issues" like a link failure due
 183 to an unplugged cable.  If it's pretty much the point of running a routing
 184 daemon for, it's not a warning or an error, just business as usual.
 185
 186 The user should be able to see the state of these bits from operational
 187 state output, i.e. `show interface` or `show foobar neighbors`.  The log
 188 message indicating the change may have been printed weeks ago, but the
 189 state can always be viewed.  (If some state change has an info message but
 190 no "show" command, maybe that command needs to be added.)
 191
 192 Examples:
 193
 194 * all kinds of up/down state changes
 195
 196   * interface coming up or going down
 197   * addresses being added or deleted
 198   * peers and neighbors coming up or going down
 199
 200 * rejection of some routes due to user-configured route maps
 201 * backwards compatibility handling because another system on the network
 202   has a different or smaller feature set
 203
 204 .. note::
 205    The previously used **notify** priority is replaced with *info* in all
 206    cases.  We don't currently have a well-defined use case for it.
 207
 208
 209 Debug messages and asserts
 210 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 211
 212 Everything that is only interesting on-demand, or only while developing,
 213 is a **debug** message.  It might be interesting to the user for a
 214 particularly evasive issue, but in general these are details that an
 215 average user might not even be able to make sense of.
 216
 217 Most (or all?) debug messages should be behind a `debug foobar` category
 218 switch that controls which subset of these messages is currently
 219 interesting and thus printed.  If a debug message doesn't have such a
 220 guard, there should be a good explanation as to why.
 221
 222 Conversely, debug messages are the only thing that should be guarded by
 223 these switches.  Neither info nor warning or error messages should be
 224 hidden in this way.
 225
 226 **Asserts** should only be used as pretty crashes.  We are expecting that
 227 asserts remain enabled in production builds, but please try to not use
 228 asserts in a way that would cause a security problem if the assert wasn't
 229 there (i.e. don't use them for length checks.)
 230
 231 The purpose of asserts is mainly to help development and bug hunting.  If
 232 the daemon crashes, then having some more information is nice, and the
 233 assert can provide crucial hints that cut down on the time needed to track
 234 an issue.  That said, if the issue can be reasonably handled and/or isn't
 235 going to crash the daemon, it shouldn't be an assert.
 236
 237 For anything else where internal constraints are violated but we're not
 238 breaking due to it, it's an error instead (not a debug.)  These require
 239 "user action" of notifying the developers.
 240
 241 Examples:
 242
 243 * mismatched :code:`prev`/:code:`next` pointers in lists
 244 * some field that is absolutely needed is :code:`NULL`
 245 * any other kind of data structure corruption that will cause the daemon
 246   to crash sooner or later, one way or another
 247
 248 Thread-local buffering
 249 ----------------------
 250
 251 The core logging code in :file:`lib/zlog.c` allows setting up per-thread log
 252 message buffers in order to improve logging performance.  The following rules
 253 apply for this buffering:
 254
 255 * Only messages of priority *DEBUG* or *INFO* are buffered.
 256 * Any higher-priority message causes the thread's entire buffer to be flushed,
 257   thus message ordering is preserved on a per-thread level.
 258 * There is no guarantee on ordering between different threads;  in most cases
 259   this is arbitrary to begin with since the threads essentially race each
 260   other in printing log messages.  If an order is established with some
 261   synchronization primitive, add calls to :c:func:`zlog_tls_buffer_flush()`.
 262 * The buffers are only ever accessed by the thread they are created by.  This
 263   means no locking is necessary.
 264
 265 Both the main/default thread and additional threads created by
 266 :c:func:`frr_pthread_new()` with the default :c:func:`frr_run()` handler will
 267 initialize thread-local buffering and call :c:func:`zlog_tls_buffer_flush()`
 268 when idle.
 269
 270 If some piece of code runs for an extended period, it may be useful to insert
 271 calls to :c:func:`zlog_tls_buffer_flush()` in appropriate places:
 272
 273 .. c:function:: void zlog_tls_buffer_flush(void)
 274
 275    Write out any pending log messages that the calling thread may have in its
 276    buffer.  This function is safe to call regardless of the per-thread log
 277    buffer being set up / in use or not.
 278
 279 When working with threads that do not use the :c:type:`struct thread_master`
 280 event loop, per-thread buffers can be managed with:
 281
 282 .. c:function:: void zlog_tls_buffer_init(void)
 283
 284    Set up thread-local buffering for log messages.  This function may be
 285    called repeatedly without adverse effects, but remember to call
 286    :c:func:`zlog_tls_buffer_fini()` at thread exit.
 287
 288    .. warning::
 289
 290       If this function is called, but :c:func:`zlog_tls_buffer_flush()` is
 291       not used, log message output will lag behind since messages will only be
 292       written out when the buffer is full.
 293
 294       Exiting the thread without calling :c:func:`zlog_tls_buffer_fini()`
 295       will cause buffered log messages to be lost.
 296
 297 .. c:function:: void zlog_tls_buffer_fini(void)
 298
 299    Flush pending messages and tear down thread-local log message buffering.
 300    This function may be called repeatedly regardless of whether
 301    :c:func:`zlog_tls_buffer_init()` was ever called.
 302
 303 Log targets
 304 -----------
 305
 306 The actual logging subsystem (in :file:`lib/zlog.c`) is heavily separated
 307 from the actual log writers.  It uses an atomic linked-list (`zlog_targets`)
 308 with RCU to maintain the log targets to be called.  This list is intended to
 309 function as "backend" only, it **is not used for configuration**.
 310
 311 Logging targets provide their configuration layer on top of this and maintain
 312 their own capability to enumerate and store their configuration.  Some targets
 313 (e.g. syslog) are inherently single instance and just stuff their config in
 314 global variables.  Others (e.g. file/fd output) are multi-instance capable.
 315 There is another layer boundary here between these and the VTY configuration
 316 that they use.
 317
 318 Basic internals
 319 ^^^^^^^^^^^^^^^
 320
 321 .. c:type:: struct zlog_target
 322
 323    This struct needs to be filled in by any log target and then passed to
 324    :c:func:`zlog_target_replace()`.  After it has been registered,
 325    **RCU semantics apply**.  Most changes to associated data should make a
 326    copy, change that, and then replace the entire struct.
 327
 328    Additional per-target data should be "appended" by embedding this struct
 329    into a larger one, for use with `containerof()`, and
 330    :c:func:`zlog_target_clone()` and :c:func:`zlog_target_free()` should be
 331    used to allocate/free the entire container struct.
 332
 333    Do not use this structure to maintain configuration.  It should only
 334    contain (a copy of) the data needed to perform the actual logging.  For
 335    example, the syslog target uses this:
 336
 337    .. code-block:: c
 338
 339       struct zlt_syslog {
 340           struct zlog_target zt;
 341           int syslog_facility;
 342       };
 343
 344       static void zlog_syslog(struct zlog_target *zt, struct zlog_msg *msgs[], size_t nmsgs)
 345       {
 346           struct zlt_syslog *zte = container_of(zt, struct zlt_syslog, zt);
 347           size_t i;
 348
 349           for (i = 0; i < nmsgs; i++)
 350               if (zlog_msg_prio(msgs[i]) <= zt->prio_min)
 351                   syslog(zlog_msg_prio(msgs[i]) | zte->syslog_facility, "%s",
 352                          zlog_msg_text(msgs[i], NULL));
 353       }
 354
 355
 356 .. c:function:: struct zlog_target *zlog_target_clone(struct memtype *mt, struct zlog_target *oldzt, size_t size)
 357
 358    Allocates a logging target struct.  Note that the ``oldzt`` argument may be
 359    ``NULL`` to allocate a "from scratch".  If ``oldzt`` is not ``NULL``, the
 360    generic bits in :c:type:`struct zlog_target` are copied.  **Target specific
 361    bits are not copied.**
 362
 363 .. c:function:: struct zlog_target *zlog_target_replace(struct zlog_target *oldzt, struct zlog_target *newzt)
 364
 365    Adds, replaces or deletes a logging target (either ``oldzt`` or ``newzt`` may be ``NULL``.)
 366
 367    Returns ``oldzt`` for freeing.  The target remains possibly in use by
 368    other threads until the RCU cycle ends.  This implies you cannot release
 369    resources (e.g. memory, file descriptors) immediately.
 370
 371    The replace operation is not atomic; for a brief period it is possible that
 372    messages are delivered on both ``oldzt`` and ``newzt``.
 373
 374    .. warning::
 375
 376       ``oldzt`` must remain **functional** until the RCU cycle ends.
 377
 378 .. c:function:: void zlog_target_free(struct memtype *mt, struct zlog_target *zt)
 379
 380    Counterpart to :c:func:`zlog_target_clone()`, frees a target (using RCU.)
 381
 382 .. c:member:: void (*zlog_target.logfn)(struct zlog_target *zt, struct zlog_msg *msgs[], size_t nmsg)
 383
 384    Called on a target to deliver "normal" logging messages.  ``msgs`` is an
 385    array of opaque structs containing the actual message.  Use ``zlog_msg_*``
 386    functions to access message data (this is done to allow some optimizations,
 387    e.g.  lazy formatting the message text and timestamp as needed.)
 388
 389    .. note::
 390
 391       ``logfn()`` must check each individual message's priority value against
 392       the configured ``prio_min``.  While the ``prio_min`` field is common to
 393       all targets and used by the core logging code to early-drop unneeded log
 394       messages, the array is **not** filtered for each ``logfn()`` call.
 395
 396 .. c:member:: void (*zlog_target.logfn_sigsafe)(struct zlog_target *zt, const char *text, size_t len)
 397
 398    Called to deliver "exception" logging messages (i.e. SEGV messages.)
 399    Must be Async-Signal-Safe (may not allocate memory or call "complicated"
 400    libc functions.)  May be ``NULL`` if the log target cannot handle this.
 401
 402 Standard targets
 403 ^^^^^^^^^^^^^^^^
 404
 405 :file:`lib/zlog_targets.c` provides the standard file / fd / syslog targets.
 406 The syslog target is single-instance while file / fd targets can be
 407 instantiated as needed.  There are 3 built-in targets that are fully
 408 autonomous without any config:
 409
 410 - startup logging to `stderr`, until either :c:func:`zlog_startup_end()` or
 411   :c:func:`zlog_aux_init()` is called.
 412 - stdout logging for non-daemon programs using :c:func:`zlog_aux_init()`
 413 - crashlogs written to :file:`/var/tmp/frr.daemon.crashlog`
 414
 415 The regular CLI/command-line logging setup is handled by :file:`lib/log_vty.c`
 416 which makes the appropriate instantiations of syslog / file / fd targets.
 417
 418 .. todo::
 419
 420   :c:func:`zlog_startup_end()` should do an explicit switchover from
 421   startup stderr logging to configured logging.  Currently, configured logging
 422   starts in parallel as soon as the respective setup is executed.  This results
 423   in some duplicate logging.