]>
Commit | Line | Data |
---|---|---|
3e5d2f71 DL |
1 | Developer's Guide to Logging |
2 | ============================ | |
3 | ||
4 | One of the most frequent decisions to make while writing code for FRR is what | |
5 | to log, what level to log it at, and when to log it. Here is a list of | |
6 | recommendations for these decisions. | |
7 | ||
8 | ||
2a37d6c5 DL |
9 | printfrr() |
10 | ---------- | |
11 | ||
12 | ``printfrr()`` is FRR's modified version of ``printf()``, designed to make | |
13 | life easier when printing nontrivial datastructures. The following variants | |
14 | are available: | |
15 | ||
16 | .. c:function:: ssize_t snprintfrr(char *buf, size_t len, const char *fmt, ...) | |
17 | .. c:function:: ssize_t vsnprintfrr(char *buf, size_t len, const char *fmt, va_list) | |
18 | ||
19 | These correspond to ``snprintf``/``vsnprintf``. If you pass NULL for buf | |
20 | or 0 for len, no output is written but the return value is still calculated. | |
21 | ||
22 | The return value is always the full length of the output, unconstrained by | |
23 | `len`. It does **not** include the terminating ``\0`` character. A | |
24 | malformed format string can result in a ``-1`` return value. | |
25 | ||
26 | .. c:function:: ssize_t csnprintfrr(char *buf, size_t len, const char *fmt, ...) | |
27 | .. c:function:: ssize_t vcsnprintfrr(char *buf, size_t len, const char *fmt, va_list) | |
28 | ||
29 | Same as above, but the ``c`` stands for "continue" or "concatenate". The | |
30 | output is appended to the string instead of overwriting it. | |
31 | ||
32 | .. c:function:: char *asprintfrr(struct memtype *mt, const char *fmt, ...) | |
33 | .. c:function:: char *vasprintfrr(struct memtype *mt, const char *fmt, va_list) | |
34 | ||
35 | These functions allocate a dynamic buffer (using MTYPE `mt`) and print to | |
36 | that. If the format string is malformed, they return a copy of the format | |
37 | string, so the return value is always non-NULL and always dynamically | |
38 | allocated with `mt`. | |
39 | ||
40 | .. c:function:: char *asnprintfrr(struct memtype *mt, char *buf, size_t len, const char *fmt, ...) | |
41 | .. c:function:: char *vasnprintfrr(struct memtype *mt, char *buf, size_t len, const char *fmt, va_list) | |
42 | ||
43 | This variant tries to use the static buffer provided, but falls back to | |
44 | dynamic allocation if it is insufficient. | |
45 | ||
46 | The return value can be either `buf` or a newly allocated string using | |
47 | `mt`. You MUST free it like this:: | |
48 | ||
49 | char *ret = asnprintfrr(MTYPE_FOO, buf, sizeof(buf), ...); | |
50 | if (ret != buf) | |
51 | XFREE(MTYPE_FOO, ret); | |
52 | ||
53 | Extensions | |
54 | ^^^^^^^^^^ | |
55 | ||
56 | ``printfrr()`` format strings can be extended with suffixes after `%p` or | |
57 | `%d`. The following extended format specifiers are available: | |
58 | ||
59 | +-----------+--------------------------+----------------------------------------------+ | |
60 | | Specifier | Argument | Output | | |
61 | +===========+==========================+==============================================+ | |
62 | | ``%Lu`` | ``uint64_t`` | ``12345`` | | |
63 | +-----------+--------------------------+----------------------------------------------+ | |
64 | | ``%Ld`` | ``int64_t`` | ``-12345`` | | |
65 | +-----------+--------------------------+----------------------------------------------+ | |
66 | | ``%pI4`` | ``struct in_addr *`` | ``1.2.3.4`` | | |
67 | | | | | | |
68 | | | ``in_addr_t *`` | | | |
69 | +-----------+--------------------------+----------------------------------------------+ | |
70 | | ``%pI6`` | ``struct in6_addr *`` | ``fe80::1234`` | | |
71 | +-----------+--------------------------+----------------------------------------------+ | |
72 | | ``%pFX`` | ``struct prefix *`` | ``fe80::1234/64`` | | |
73 | +-----------+--------------------------+----------------------------------------------+ | |
74 | | ``%pSG4`` | ``struct prefix_sg *`` | ``(*,1.2.3.4)`` | | |
75 | +-----------+--------------------------+----------------------------------------------+ | |
76 | | ``%pRN`` | ``struct route_node *`` | ``192.168.1.0/24`` (dst-only node) | | |
77 | | | | | | |
78 | | | | ``2001:db8::/32 from fe80::/64`` (SADR node) | | |
79 | +-----------+--------------------------+----------------------------------------------+ | |
80 | | ``%pNHv`` | ``struct nexthop *`` | ``1.2.3.4, via eth0`` | | |
81 | +-----------+--------------------------+----------------------------------------------+ | |
82 | | ``%pNHs`` | ``struct nexthop *`` | ``1.2.3.4 if 15`` | | |
83 | +-----------+--------------------------+----------------------------------------------+ | |
84 | ||
85 | Printf features like field lengths can be used normally with these extensions, | |
86 | e.g. ``%-15pI4`` works correctly. | |
87 | ||
88 | The extension specifier after ``%p`` or ``%d`` is always an uppercase letter; | |
89 | by means of established pattern uppercase letters and numbers form the type | |
90 | identifier which may be followed by lowercase flags. | |
91 | ||
92 | You can grep the FRR source for ``printfrr_ext_autoreg`` to see all extended | |
93 | printers and what exactly they do. More printers are likely to be added as | |
94 | needed/useful, so the list above may become outdated. | |
95 | ||
96 | ``%Ld`` is not an "extension" for printfrr; it's wired directly into the main | |
97 | printf logic. | |
98 | ||
99 | .. note:: | |
100 | ||
101 | The ``zlog_*``/``flog_*`` and ``vty_out`` functions all use printfrr | |
102 | internally, so these extensions are available there. However, they are | |
103 | **not** available when calling ``snprintf`` directly. You need to call | |
104 | ``snprintfrr`` instead. | |
105 | ||
106 | AS-Safety | |
107 | ^^^^^^^^^ | |
108 | ||
109 | ``printfrr()`` are AS-Safe under the following conditions: | |
110 | ||
111 | * the ``[v]as[n]printfrr`` variants are not AS-Safe (allocating memory) | |
112 | * floating point specifiers are not AS-Safe (system printf is used for these) | |
113 | * the positional ``%1$d`` syntax should not be used (8 arguments are supported | |
114 | while AS-Safe) | |
115 | * extensions are only AS-Safe if their printer is AS-Safe | |
116 | ||
3e5d2f71 DL |
117 | Errors and warnings |
118 | ------------------- | |
119 | ||
120 | If it is something that the user will want to look at and maybe do | |
121 | something, it is either an **error** or a **warning**. | |
122 | ||
123 | We're expecting that warnings and errors are in some way visible to the | |
124 | user (in the worst case by looking at the log after the network broke, but | |
125 | maybe by a syslog collector from all routers.) Therefore, anything that | |
126 | needs to get the user in the loop—and only these things—are warnings or | |
127 | errors. | |
128 | ||
56f0bea7 | 129 | Note that this doesn't necessarily mean the user needs to fix something in |
3e5d2f71 DL |
130 | the FRR instance. It also includes when we detect something else needs |
131 | fixing, for example another router, the system we're running on, or the | |
132 | configuration. The common point is that the user should probably do | |
133 | *something*. | |
134 | ||
135 | Deciding between a warning and an error is slightly less obvious; the rule | |
136 | of thumb here is that an error will cause considerable fallout beyond its | |
137 | direct effect. Closing a BGP session due to a malformed update is an error | |
138 | since all routes from the peer are dropped; discarding one route because | |
139 | its attributes don't make sense is a warning. | |
140 | ||
141 | This also loosely corresponds to the kind of reaction we're expecting from | |
142 | the user. An error is likely to need immediate response while a warning | |
143 | might be snoozed for a bit and addressed as part of general maintenance. | |
144 | If a problem will self-repair (e.g. by retransmits), it should be a | |
145 | warning—unless the impact until that self-repair is very harsh. | |
146 | ||
147 | Examples for warnings: | |
148 | ||
149 | * a BGP update, LSA or LSP could not be processed, but operation is | |
150 | proceeding and the broken pieces are likely to self-fix later | |
151 | * some kind of controller cannot be reached, but we can work without it | |
152 | * another router is using some unknown or unsupported capability | |
153 | ||
154 | Examples for errors: | |
155 | ||
156 | * dropping a BGP session due to malformed data | |
157 | * a socket for routing protocol operation cannot be opened | |
158 | * desynchronization from network state because something went wrong | |
159 | * *everything that we as developers would really like to be notified about, | |
160 | i.e. some assumption in the code isn't holding up* | |
161 | ||
162 | ||
163 | Informational messages | |
164 | ---------------------- | |
165 | ||
166 | Anything that provides introspection to the user during normal operation | |
167 | is an **info** message. | |
168 | ||
169 | This includes all kinds of operational state transitions and events, | |
170 | especially if they might be interesting to the user during the course of | |
171 | figuring out a warning or an error. | |
172 | ||
173 | By itself, these messages should mostly be statements of fact. They might | |
174 | indicate the order and relationship in which things happened. Also covered | |
175 | are conditions that might be "operational issues" like a link failure due | |
176 | to an unplugged cable. If it's pretty much the point of running a routing | |
177 | daemon for, it's not a warning or an error, just business as usual. | |
178 | ||
179 | The user should be able to see the state of these bits from operational | |
180 | state output, i.e. `show interface` or `show foobar neighbors`. The log | |
181 | message indicating the change may have been printed weeks ago, but the | |
182 | state can always be viewed. (If some state change has an info message but | |
183 | no "show" command, maybe that command needs to be added.) | |
184 | ||
185 | Examples: | |
186 | ||
187 | * all kinds of up/down state changes | |
188 | ||
189 | * interface coming up or going down | |
190 | * addresses being added or deleted | |
191 | * peers and neighbors coming up or going down | |
192 | ||
193 | * rejection of some routes due to user-configured route maps | |
194 | * backwards compatibility handling because another system on the network | |
195 | has a different or smaller feature set | |
196 | ||
197 | .. note:: | |
198 | The previously used **notify** priority is replaced with *info* in all | |
199 | cases. We don't currently have a well-defined use case for it. | |
200 | ||
201 | ||
202 | Debug messages and asserts | |
203 | -------------------------- | |
204 | ||
205 | Everything that is only interesting on-demand, or only while developing, | |
206 | is a **debug** message. It might be interesting to the user for a | |
207 | particularly evasive issue, but in general these are details that an | |
208 | average user might not even be able to make sense of. | |
209 | ||
210 | Most (or all?) debug messages should be behind a `debug foobar` category | |
211 | switch that controls which subset of these messages is currently | |
212 | interesting and thus printed. If a debug message doesn't have such a | |
213 | guard, there should be a good explanation as to why. | |
214 | ||
215 | Conversely, debug messages are the only thing that should be guarded by | |
216 | these switches. Neither info nor warning or error messages should be | |
217 | hidden in this way. | |
218 | ||
219 | **Asserts** should only be used as pretty crashes. We are expecting that | |
220 | asserts remain enabled in production builds, but please try to not use | |
221 | asserts in a way that would cause a security problem if the assert wasn't | |
222 | there (i.e. don't use them for length checks.) | |
223 | ||
224 | The purpose of asserts is mainly to help development and bug hunting. If | |
225 | the daemon crashes, then having some more information is nice, and the | |
226 | assert can provide crucial hints that cut down on the time needed to track | |
227 | an issue. That said, if the issue can be reasonably handled and/or isn't | |
228 | going to crash the daemon, it shouldn't be an assert. | |
229 | ||
230 | For anything else where internal constraints are violated but we're not | |
231 | breaking due to it, it's an error instead (not a debug.) These require | |
232 | "user action" of notifying the developers. | |
233 | ||
234 | Examples: | |
235 | ||
236 | * mismatched :code:`prev`/:code:`next` pointers in lists | |
237 | * some field that is absolutely needed is :code:`NULL` | |
238 | * any other kind of data structure corruption that will cause the daemon | |
239 | to crash sooner or later, one way or another |