doc/developer: logging guide

author David Lamparter <equinox@opensourcerouting.org>

Wed, 22 Aug 2018 15:39:12 +0000 (17:39 +0200)

committer David Lamparter <equinox@opensourcerouting.org>

Wed, 22 Aug 2018 15:39:34 +0000 (17:39 +0200)
author David Lamparter <equinox@opensourcerouting.org>
Wed, 22 Aug 2018 15:39:12 +0000 (17:39 +0200)
committer David Lamparter <equinox@opensourcerouting.org>
Wed, 22 Aug 2018 15:39:34 +0000 (17:39 +0200)
diff --git a/doc/Makefile.am b/doc/Makefile.am

index 6f72585485470fddf026dc497912baaf4fd84bd6..f4218650545f93429e57dad8dc38bc02e0869d65 100644 (file)
--- a/doc/Makefile.am
+++ b/doc/Makefile.am
@@ -193,6 +193,7 @@ EXTRA_DIST = frr-sphinx.mk \
         developer/index.rst \
         developer/ldpd-basic-test-setup.md \
         developer/library.rst \
+       developer/logging.rst \
         developer/Makefile.in \
         developer/maintainer-release-build.rst \
         developer/memtypes.rst \
diff --git a/doc/developer/library.rst b/doc/developer/library.rst

index f6efa33051d81ed47b5c2a25f7730e4733a3e763..77b2f229b785a02e7fb2c871dbec5773df312b9d 100644 (file)
--- a/doc/developer/library.rst
+++ b/doc/developer/library.rst
@@ -7,6 +7,7 @@ Library Facilities (libfrr)
  .. toctree::
     :maxdepth: 2
  
+   logging
     memtypes
     hooks
     cli
diff --git a/doc/developer/logging.rst b/doc/developer/logging.rst

new file mode 100644 (file)

index 0000000..4338d90
--- /dev/null
+++ b/doc/developer/logging.rst
@@ -0,0 +1,131 @@
+Developer's Guide to Logging
+============================
+
+One of the most frequent decisions to make while writing code for FRR is what
+to log, what level to log it at, and when to log it.  Here is a list of
+recommendations for these decisions.
+
+
+Errors and warnings
+-------------------
+
+If it is something that the user will want to look at and maybe do
+something, it is either an **error** or a **warning**.
+
+We're expecting that warnings and errors are in some way visible to the
+user (in the worst case by looking at the log after the network broke, but
+maybe by a syslog collector from all routers.)  Therefore, anything that
+needs to get the user in the loop—and only these things—are warnings or
+errors.
+
+Note that this doesn't neccessarily mean the user needs to fix something in
+the FRR instance.  It also includes when we detect something else needs
+fixing, for example another router, the system we're running on, or the
+configuration.  The common point is that the user should probably do
+*something*.
+
+Deciding between a warning and an error is slightly less obvious; the rule
+of thumb here is that an error will cause considerable fallout beyond its
+direct effect.  Closing a BGP session due to a malformed update is an error
+since all routes from the peer are dropped; discarding one route because
+its attributes don't make sense is a warning.
+
+This also loosely corresponds to the kind of reaction we're expecting from
+the user.  An error is likely to need immediate response while a warning
+might be snoozed for a bit and addressed as part of general maintenance.
+If a problem will self-repair (e.g. by retransmits), it should be a
+warning—unless the impact until that self-repair is very harsh.
+
+Examples for warnings:
+
+* a BGP update, LSA or LSP could not be processed, but operation is
+  proceeding and the broken pieces are likely to self-fix later
+* some kind of controller cannot be reached, but we can work without it
+* another router is using some unknown or unsupported capability
+
+Examples for errors:
+
+* dropping a BGP session due to malformed data
+* a socket for routing protocol operation cannot be opened
+* desynchronization from network state because something went wrong
+* *everything that we as developers would really like to be notified about,
+  i.e. some assumption in the code isn't holding up*
+
+
+Informational messages
+----------------------
+
+Anything that provides introspection to the user during normal operation
+is an **info** message.
+
+This includes all kinds of operational state transitions and events,
+especially if they might be interesting to the user during the course of
+figuring out a warning or an error.
+
+By itself, these messages should mostly be statements of fact.  They might
+indicate the order and relationship in which things happened.  Also covered
+are conditions that might be "operational issues" like a link failure due
+to an unplugged cable.  If it's pretty much the point of running a routing
+daemon for, it's not a warning or an error, just business as usual.
+
+The user should be able to see the state of these bits from operational
+state output, i.e. `show interface` or `show foobar neighbors`.  The log
+message indicating the change may have been printed weeks ago, but the
+state can always be viewed.  (If some state change has an info message but
+no "show" command, maybe that command needs to be added.)
+
+Examples:
+
+* all kinds of up/down state changes
+
+  * interface coming up or going down
+  * addresses being added or deleted
+  * peers and neighbors coming up or going down
+
+* rejection of some routes due to user-configured route maps
+* backwards compatibility handling because another system on the network
+  has a different or smaller feature set
+
+.. note::
+   The previously used **notify** priority is replaced with *info* in all
+   cases.  We don't currently have a well-defined use case for it.
+
+
+Debug messages and asserts
+--------------------------
+
+Everything that is only interesting on-demand, or only while developing,
+is a **debug** message.  It might be interesting to the user for a
+particularly evasive issue, but in general these are details that an
+average user might not even be able to make sense of.
+
+Most (or all?) debug messages should be behind a `debug foobar` category
+switch that controls which subset of these messages is currently
+interesting and thus printed.  If a debug message doesn't have such a
+guard, there should be a good explanation as to why.
+
+Conversely, debug messages are the only thing that should be guarded by
+these switches.  Neither info nor warning or error messages should be
+hidden in this way.
+
+**Asserts** should only be used as pretty crashes.  We are expecting that
+asserts remain enabled in production builds, but please try to not use
+asserts in a way that would cause a security problem if the assert wasn't
+there (i.e. don't use them for length checks.)
+
+The purpose of asserts is mainly to help development and bug hunting.  If
+the daemon crashes, then having some more information is nice, and the
+assert can provide crucial hints that cut down on the time needed to track
+an issue.  That said, if the issue can be reasonably handled and/or isn't
+going to crash the daemon, it shouldn't be an assert.
+
+For anything else where internal constraints are violated but we're not
+breaking due to it, it's an error instead (not a debug.)  These require
+"user action" of notifying the developers.
+
+Examples:
+
+* mismatched :code:`prev`/:code:`next` pointers in lists
+* some field that is absolutely needed is :code:`NULL`
+* any other kind of data structure corruption that will cause the daemon
+  to crash sooner or later, one way or another
author	David Lamparter <equinox@opensourcerouting.org>
	Wed, 22 Aug 2018 15:39:12 +0000 (17:39 +0200)
committer	David Lamparter <equinox@opensourcerouting.org>
	Wed, 22 Aug 2018 15:39:34 +0000 (17:39 +0200)
doc/Makefile.am		patch \| blob \| blame \| history
doc/developer/library.rst		patch \| blob \| blame \| history
doc/developer/logging.rst	[new file with mode: 0644]	patch \| blob