]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/logging.rst
update sources to ceph Nautilus 14.2.1
[ceph.git] / ceph / doc / dev / logging.rst
CommitLineData
d2e6a577
FG
1
2Use of the cluster log
3======================
4
5(Note: none of this applies to the local "dout" logging. This is about
6the cluster log that we send through the mon daemons)
7
8Severity
9--------
10
11Use ERR for situations where the cluster cannot do its job for some reason.
12For example: we tried to do a write, but it returned an error, or we tried
13to read something, but it's corrupt so we can't, or we scrubbed a PG but
14the data was inconsistent so we can't recover.
15
16Use WRN for incidents that the cluster can handle, but have some abnormal/negative
11fdf7f2 17aspect, such as a temporary degradation of service, or an unexpected internal
d2e6a577
FG
18value. For example, a metadata error that can be auto-fixed, or a slow operation.
19
20Use INFO for ordinary cluster operations that do not indicate a fault in
21Ceph. It is especially important that INFO level messages are clearly
22worded and do not cause confusion or alarm.
23
24Frequency
25---------
26
27It is important that messages of all severities are not excessively
28frequent. Consumers may be using a rotating log buffer that contains
29messages of all severities, so even DEBUG messages could interfere
30with proper display of the latest INFO messages if the DEBUG messages
31are too frequent.
32
33Remember that if you have a bad state (as opposed to event), that is
34what health checks are for -- do not spam the cluster log to indicate
35a continuing unhealthy state.
36
37Do not emit cluster log messages for events that scale with
38the number of clients or level of activity on the system, or for
39events that occur regularly in normal operation. For example, it
40would be inappropriate to emit a INFO message about every
41new client that connects (scales with #clients), or to emit and INFO
42message about every CephFS subtree migration (occurs regularly).
43
44Language and formatting
45-----------------------
46
47(Note: these guidelines matter much less for DEBUG-level messages than
48 for INFO and above. Concentrate your efforts on making INFO/WRN/ERR
49 messages as readable as possible.)
50
51Use the passive voice. For example, use "Object xyz could not be read", rather
52than "I could not read the object xyz".
53
54Print long/big identifiers, such as inode numbers, as hex, prefixed
55with an 0x so that the user can tell it is hex. We do this because
56the 0x makes it unambiguous (no equivalent for decimal), and because
57the hex form is more likely to fit on the screen.
58
59Print size quantities as a human readable MB/GB/etc, including the unit
60at the end of the number. Exception: if you are specifying an offset,
61where precision is essential to the meaning, then you can specify
62the value in bytes (but print it as hex).
63
64Make a good faith effort to fit your message on a single line. It does
65not have to be guaranteed, but it should at least usually be
66the case. That means, generally, no printing of lists unless there
67are only a few items in the list.
68
69Use nouns that are meaningful to the user, and defined in the
70documentation. Common acronyms are OK -- don't waste screen space
71typing "Rados Object Gateway" instead of RGW. Do not use internal
72class names like "MDCache" or "Objecter". It is okay to mention
73internal structures if they are the direct subject of the message,
74for example in a corruption, but use plain english.
75Example: instead of "Objecter requests" say "OSD client requests"
76Example: it is okay to mention internal structure in the context
11fdf7f2 77of "Corrupt session table" (but don't say "Corrupt SessionTable")
d2e6a577
FG
78
79Where possible, describe the consequence for system availability, rather
80than only describing the underlying state. For example, rather than
81saying "MDS myfs.0 is replaying", say that "myfs is degraded, waiting
82for myfs.0 to finish starting".
83
84While common acronyms are fine, don't randomly truncate words. It's not
85"dir ino", it's "directory inode".
86
87If you're logging something that "should never happen", i.e. a situation
88where it would be an assertion, but we're helpfully not crashing, then
89make that clear in the language -- this is probably not a situation
90that the user can remediate themselves.
91
92Avoid UNIX/programmer jargon. Instead of "errno", just say "error" (or
93preferably give something more descriptive than the number!)
94
95Do not mention cluster map epochs unless they are essential to
96the meaning of the message. For example, "OSDMap epoch 123 is corrupt"
97would be okay (the epoch is the point of the message), but saying "OSD
98123 is down in OSDMap epoch 456" would not be (the osdmap and epoch
99concepts are an implementation detail, the down-ness of the OSD
100is the real message). Feel free to send additional detail to
101the daemon's local log (via `dout`/`derr`).
102
103If you log a problem that may go away in the future, make sure you
104also log when it goes away. Whatever priority you logged the original
105message at, log the "going away" message at INFO.
106