]>
Commit | Line | Data |
---|---|---|
d2e6a577 FG |
1 | |
2 | Use of the cluster log | |
3 | ====================== | |
4 | ||
5 | (Note: none of this applies to the local "dout" logging. This is about | |
6 | the cluster log that we send through the mon daemons) | |
7 | ||
8 | Severity | |
9 | -------- | |
10 | ||
11 | Use ERR for situations where the cluster cannot do its job for some reason. | |
12 | For example: we tried to do a write, but it returned an error, or we tried | |
13 | to read something, but it's corrupt so we can't, or we scrubbed a PG but | |
14 | the data was inconsistent so we can't recover. | |
15 | ||
16 | Use WRN for incidents that the cluster can handle, but have some abnormal/negative | |
11fdf7f2 | 17 | aspect, such as a temporary degradation of service, or an unexpected internal |
d2e6a577 FG |
18 | value. For example, a metadata error that can be auto-fixed, or a slow operation. |
19 | ||
20 | Use INFO for ordinary cluster operations that do not indicate a fault in | |
21 | Ceph. It is especially important that INFO level messages are clearly | |
22 | worded and do not cause confusion or alarm. | |
23 | ||
24 | Frequency | |
25 | --------- | |
26 | ||
27 | It is important that messages of all severities are not excessively | |
28 | frequent. Consumers may be using a rotating log buffer that contains | |
29 | messages of all severities, so even DEBUG messages could interfere | |
30 | with proper display of the latest INFO messages if the DEBUG messages | |
31 | are too frequent. | |
32 | ||
33 | Remember that if you have a bad state (as opposed to event), that is | |
34 | what health checks are for -- do not spam the cluster log to indicate | |
35 | a continuing unhealthy state. | |
36 | ||
37 | Do not emit cluster log messages for events that scale with | |
38 | the number of clients or level of activity on the system, or for | |
39 | events that occur regularly in normal operation. For example, it | |
40 | would be inappropriate to emit a INFO message about every | |
41 | new client that connects (scales with #clients), or to emit and INFO | |
42 | message about every CephFS subtree migration (occurs regularly). | |
43 | ||
44 | Language and formatting | |
45 | ----------------------- | |
46 | ||
47 | (Note: these guidelines matter much less for DEBUG-level messages than | |
48 | for INFO and above. Concentrate your efforts on making INFO/WRN/ERR | |
49 | messages as readable as possible.) | |
50 | ||
51 | Use the passive voice. For example, use "Object xyz could not be read", rather | |
52 | than "I could not read the object xyz". | |
53 | ||
54 | Print long/big identifiers, such as inode numbers, as hex, prefixed | |
55 | with an 0x so that the user can tell it is hex. We do this because | |
56 | the 0x makes it unambiguous (no equivalent for decimal), and because | |
57 | the hex form is more likely to fit on the screen. | |
58 | ||
59 | Print size quantities as a human readable MB/GB/etc, including the unit | |
60 | at the end of the number. Exception: if you are specifying an offset, | |
61 | where precision is essential to the meaning, then you can specify | |
62 | the value in bytes (but print it as hex). | |
63 | ||
64 | Make a good faith effort to fit your message on a single line. It does | |
65 | not have to be guaranteed, but it should at least usually be | |
66 | the case. That means, generally, no printing of lists unless there | |
67 | are only a few items in the list. | |
68 | ||
69 | Use nouns that are meaningful to the user, and defined in the | |
70 | documentation. Common acronyms are OK -- don't waste screen space | |
71 | typing "Rados Object Gateway" instead of RGW. Do not use internal | |
72 | class names like "MDCache" or "Objecter". It is okay to mention | |
73 | internal structures if they are the direct subject of the message, | |
f67539c2 | 74 | for example in a corruption, but use plain English. |
d2e6a577 FG |
75 | Example: instead of "Objecter requests" say "OSD client requests" |
76 | Example: it is okay to mention internal structure in the context | |
11fdf7f2 | 77 | of "Corrupt session table" (but don't say "Corrupt SessionTable") |
d2e6a577 FG |
78 | |
79 | Where possible, describe the consequence for system availability, rather | |
80 | than only describing the underlying state. For example, rather than | |
81 | saying "MDS myfs.0 is replaying", say that "myfs is degraded, waiting | |
82 | for myfs.0 to finish starting". | |
83 | ||
84 | While common acronyms are fine, don't randomly truncate words. It's not | |
85 | "dir ino", it's "directory inode". | |
86 | ||
87 | If you're logging something that "should never happen", i.e. a situation | |
88 | where it would be an assertion, but we're helpfully not crashing, then | |
89 | make that clear in the language -- this is probably not a situation | |
90 | that the user can remediate themselves. | |
91 | ||
92 | Avoid UNIX/programmer jargon. Instead of "errno", just say "error" (or | |
93 | preferably give something more descriptive than the number!) | |
94 | ||
95 | Do not mention cluster map epochs unless they are essential to | |
96 | the meaning of the message. For example, "OSDMap epoch 123 is corrupt" | |
97 | would be okay (the epoch is the point of the message), but saying "OSD | |
98 | 123 is down in OSDMap epoch 456" would not be (the osdmap and epoch | |
99 | concepts are an implementation detail, the down-ness of the OSD | |
100 | is the real message). Feel free to send additional detail to | |
101 | the daemon's local log (via `dout`/`derr`). | |
102 | ||
103 | If you log a problem that may go away in the future, make sure you | |
104 | also log when it goes away. Whatever priority you logged the original | |
105 | message at, log the "going away" message at INFO. | |
106 |