[ceph.git] / ceph / doc / dev / logging.rst


Use of the cluster log
======================

(Note: none of this applies to the local "dout" logging.  This is about
the cluster log that we send through the mon daemons)

Severity
--------

Use ERR for situations where the cluster cannot do its job for some reason.
For example: we tried to do a write, but it returned an error, or we tried
to read something, but it's corrupt so we can't, or we scrubbed a PG but
the data was inconsistent so we can't recover.

Use WRN for incidents that the cluster can handle, but have some abnormal/negative
aspect, such as a temporary degradation of service, or an unexpected internal
value.  For example, a metadata error that can be auto-fixed, or a slow operation.

Use INFO for ordinary cluster operations that do not indicate a fault in
Ceph.  It is especially important that INFO level messages are clearly
worded and do not cause confusion or alarm.

Frequency
---------

It is important that messages of all severities are not excessively
frequent.  Consumers may be using a rotating log buffer that contains
messages of all severities, so even DEBUG messages could interfere
with proper display of the latest INFO messages if the DEBUG messages
are too frequent.

Remember that if you have a bad state (as opposed to event), that is
what health checks are for -- do not spam the cluster log to indicate
a continuing unhealthy state.

Do not emit cluster log messages for events that scale with
the number of clients or level of activity on the system, or for
events that occur regularly in normal operation.  For example, it
would be inappropriate to emit a INFO message about every
new client that connects (scales with #clients), or to emit and INFO
message about every CephFS subtree migration (occurs regularly).

Language and formatting
-----------------------

(Note: these guidelines matter much less for DEBUG-level messages than
 for INFO and above.  Concentrate your efforts on making INFO/WRN/ERR
 messages as readable as possible.)

Use the passive voice.  For example, use "Object xyz could not be read", rather
than "I could not read the object xyz".

Print long/big identifiers, such as inode numbers, as hex, prefixed
with an 0x so that the user can tell it is hex.  We do this because
the 0x makes it unambiguous (no equivalent for decimal), and because
the hex form is more likely to fit on the screen.

Print size quantities as a human readable MB/GB/etc, including the unit
at the end of the number.  Exception: if you are specifying an offset,
where precision is essential to the meaning, then you can specify
the value in bytes (but print it as hex).

Make a good faith effort to fit your message on a single line.  It does
not have to be guaranteed, but it should at least usually be
the case.  That means, generally, no printing of lists unless there
are only a few items in the list.

Use nouns that are meaningful to the user, and defined in the
documentation.  Common acronyms are OK -- don't waste screen space
typing "Rados Object Gateway" instead of RGW.  Do not use internal
class names like "MDCache" or "Objecter".  It is okay to mention
internal structures if they are the direct subject of the message,
for example in a corruption, but use plain english.
Example: instead of "Objecter requests" say "OSD client requests"
Example: it is okay to mention internal structure in the context
of "Corrupt session table" (but don't say "Corrupt SessionTable")

Where possible, describe the consequence for system availability, rather
than only describing the underlying state.  For example, rather than
saying "MDS myfs.0 is replaying", say that "myfs is degraded, waiting
for myfs.0 to finish starting".

While common acronyms are fine, don't randomly truncate words.  It's not
"dir ino", it's "directory inode".

If you're logging something that "should never happen", i.e. a situation
where it would be an assertion, but we're helpfully not crashing, then
make that clear in the language -- this is probably not a situation
that the user can remediate themselves.

Avoid UNIX/programmer jargon.  Instead of "errno", just say "error" (or
preferably give something more descriptive than the number!)

Do not mention cluster map epochs unless they are essential to
the meaning of the message.  For example, "OSDMap epoch 123 is corrupt"
would be okay (the epoch is the point of the message), but saying "OSD
123 is down in OSDMap epoch 456" would not be (the osdmap and epoch
concepts are an implementation detail, the down-ness of the OSD
is the real message).  Feel free to send additional detail to
the daemon's local log (via `dout`/`derr`).

If you log a problem that may go away in the future, make sure you
also log when it goes away.  Whatever priority you logged the original
message at, log the "going away" message at INFO.
Commit	Line	Data
d2e6a577 FG	1
	2	Use of the cluster log
	3	======================
	4
	5	(Note: none of this applies to the local "dout" logging. This is about
	6	the cluster log that we send through the mon daemons)
	7
	8	Severity
	9	--------
	10
	11	Use ERR for situations where the cluster cannot do its job for some reason.
	12	For example: we tried to do a write, but it returned an error, or we tried
	13	to read something, but it's corrupt so we can't, or we scrubbed a PG but
	14	the data was inconsistent so we can't recover.
	15
	16	Use WRN for incidents that the cluster can handle, but have some abnormal/negative
11fdf7f2	17	aspect, such as a temporary degradation of service, or an unexpected internal
d2e6a577 FG	18	value. For example, a metadata error that can be auto-fixed, or a slow operation.
	19
	20	Use INFO for ordinary cluster operations that do not indicate a fault in
	21	Ceph. It is especially important that INFO level messages are clearly
	22	worded and do not cause confusion or alarm.
	23
	24	Frequency
	25	---------
	26
	27	It is important that messages of all severities are not excessively
	28	frequent. Consumers may be using a rotating log buffer that contains
	29	messages of all severities, so even DEBUG messages could interfere
	30	with proper display of the latest INFO messages if the DEBUG messages
	31	are too frequent.
	32
	33	Remember that if you have a bad state (as opposed to event), that is
	34	what health checks are for -- do not spam the cluster log to indicate
	35	a continuing unhealthy state.
	36
	37	Do not emit cluster log messages for events that scale with
	38	the number of clients or level of activity on the system, or for
	39	events that occur regularly in normal operation. For example, it
	40	would be inappropriate to emit a INFO message about every
	41	new client that connects (scales with #clients), or to emit and INFO
	42	message about every CephFS subtree migration (occurs regularly).
	43
	44	Language and formatting
	45	-----------------------
	46
	47	(Note: these guidelines matter much less for DEBUG-level messages than
	48	for INFO and above. Concentrate your efforts on making INFO/WRN/ERR
	49	messages as readable as possible.)
	50
	51	Use the passive voice. For example, use "Object xyz could not be read", rather
	52	than "I could not read the object xyz".
	53
	54	Print long/big identifiers, such as inode numbers, as hex, prefixed
	55	with an 0x so that the user can tell it is hex. We do this because
	56	the 0x makes it unambiguous (no equivalent for decimal), and because
	57	the hex form is more likely to fit on the screen.
	58
	59	Print size quantities as a human readable MB/GB/etc, including the unit
	60	at the end of the number. Exception: if you are specifying an offset,
	61	where precision is essential to the meaning, then you can specify
	62	the value in bytes (but print it as hex).
	63
	64	Make a good faith effort to fit your message on a single line. It does
	65	not have to be guaranteed, but it should at least usually be
	66	the case. That means, generally, no printing of lists unless there
	67	are only a few items in the list.
	68
	69	Use nouns that are meaningful to the user, and defined in the
	70	documentation. Common acronyms are OK -- don't waste screen space
	71	typing "Rados Object Gateway" instead of RGW. Do not use internal
	72	class names like "MDCache" or "Objecter". It is okay to mention
	73	internal structures if they are the direct subject of the message,
	74	for example in a corruption, but use plain english.
	75	Example: instead of "Objecter requests" say "OSD client requests"
	76	Example: it is okay to mention internal structure in the context
11fdf7f2	77	of "Corrupt session table" (but don't say "Corrupt SessionTable")
d2e6a577 FG	78
	79	Where possible, describe the consequence for system availability, rather
	80	than only describing the underlying state. For example, rather than
	81	saying "MDS myfs.0 is replaying", say that "myfs is degraded, waiting
	82	for myfs.0 to finish starting".
	83
	84	While common acronyms are fine, don't randomly truncate words. It's not
	85	"dir ino", it's "directory inode".
	86
	87	If you're logging something that "should never happen", i.e. a situation
	88	where it would be an assertion, but we're helpfully not crashing, then
	89	make that clear in the language -- this is probably not a situation
	90	that the user can remediate themselves.
	91
	92	Avoid UNIX/programmer jargon. Instead of "errno", just say "error" (or
	93	preferably give something more descriptive than the number!)
	94
	95	Do not mention cluster map epochs unless they are essential to
	96	the meaning of the message. For example, "OSDMap epoch 123 is corrupt"
	97	would be okay (the epoch is the point of the message), but saying "OSD
	98	123 is down in OSDMap epoch 456" would not be (the osdmap and epoch
	99	concepts are an implementation detail, the down-ness of the OSD
	100	is the real message). Feel free to send additional detail to
	101	the daemon's local log (via `dout`/`derr`).
	102
	103	If you log a problem that may go away in the future, make sure you
	104	also log when it goes away. Whatever priority you logged the original
	105	message at, log the "going away" message at INFO.
	106