]>
Commit | Line | Data |
---|---|---|
abd41443 TT |
1 | Event Tracing |
2 | ||
3 | Documentation written by Theodore Ts'o | |
95b69608 | 4 | Updated by Li Zefan and Tom Zanussi |
abd41443 | 5 | |
143c145e LZ |
6 | 1. Introduction |
7 | =============== | |
abd41443 TT |
8 | |
9 | Tracepoints (see Documentation/trace/tracepoints.txt) can be used | |
10 | without creating custom kernel modules to register probe functions | |
11 | using the event tracing infrastructure. | |
12 | ||
13 | Not all tracepoints can be traced using the event tracing system; | |
14 | the kernel developer must provide code snippets which define how the | |
15 | tracing information is saved into the tracing buffer, and how the | |
143c145e | 16 | tracing information should be printed. |
abd41443 | 17 | |
143c145e LZ |
18 | 2. Using Event Tracing |
19 | ====================== | |
20 | ||
21 | 2.1 Via the 'set_event' interface | |
22 | --------------------------------- | |
abd41443 TT |
23 | |
24 | The events which are available for tracing can be found in the file | |
52ad51e7 | 25 | /sys/kernel/debug/tracing/available_events. |
abd41443 TT |
26 | |
27 | To enable a particular event, such as 'sched_wakeup', simply echo it | |
52ad51e7 | 28 | to /sys/kernel/debug/tracing/set_event. For example: |
abd41443 | 29 | |
52ad51e7 | 30 | # echo sched_wakeup >> /sys/kernel/debug/tracing/set_event |
abd41443 | 31 | |
143c145e LZ |
32 | [ Note: '>>' is necessary, otherwise it will firstly disable |
33 | all the events. ] | |
abd41443 TT |
34 | |
35 | To disable an event, echo the event name to the set_event file prefixed | |
36 | with an exclamation point: | |
37 | ||
52ad51e7 | 38 | # echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event |
143c145e LZ |
39 | |
40 | To disable all events, echo an empty line to the set_event file: | |
41 | ||
52ad51e7 | 42 | # echo > /sys/kernel/debug/tracing/set_event |
abd41443 | 43 | |
143c145e | 44 | To enable all events, echo '*:*' or '*:' to the set_event file: |
abd41443 | 45 | |
52ad51e7 | 46 | # echo *:* > /sys/kernel/debug/tracing/set_event |
abd41443 TT |
47 | |
48 | The events are organized into subsystems, such as ext4, irq, sched, | |
49 | etc., and a full event name looks like this: <subsystem>:<event>. The | |
50 | subsystem name is optional, but it is displayed in the available_events | |
51 | file. All of the events in a subsystem can be specified via the syntax | |
52 | "<subsystem>:*"; for example, to enable all irq events, you can use the | |
53 | command: | |
54 | ||
52ad51e7 | 55 | # echo 'irq:*' > /sys/kernel/debug/tracing/set_event |
143c145e LZ |
56 | |
57 | 2.2 Via the 'enable' toggle | |
58 | --------------------------- | |
59 | ||
52ad51e7 | 60 | The events available are also listed in /sys/kernel/debug/tracing/events/ hierarchy |
143c145e LZ |
61 | of directories. |
62 | ||
63 | To enable event 'sched_wakeup': | |
64 | ||
52ad51e7 | 65 | # echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable |
143c145e LZ |
66 | |
67 | To disable it: | |
68 | ||
52ad51e7 | 69 | # echo 0 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable |
143c145e LZ |
70 | |
71 | To enable all events in sched subsystem: | |
72 | ||
52ad51e7 | 73 | # echo 1 > /sys/kernel/debug/tracing/events/sched/enable |
143c145e | 74 | |
6afb1c65 | 75 | To enable all events: |
143c145e | 76 | |
52ad51e7 | 77 | # echo 1 > /sys/kernel/debug/tracing/events/enable |
143c145e LZ |
78 | |
79 | When reading one of these enable files, there are four results: | |
80 | ||
81 | 0 - all events this file affects are disabled | |
82 | 1 - all events this file affects are enabled | |
83 | X - there is a mixture of events enabled and disabled | |
84 | ? - this file does not affect any event | |
85 | ||
020e5f85 LZ |
86 | 2.3 Boot option |
87 | --------------- | |
88 | ||
89 | In order to facilitate early boot debugging, use boot option: | |
90 | ||
91 | trace_event=[event-list] | |
92 | ||
03d646e6 LZ |
93 | event-list is a comma separated list of events. See section 2.1 for event |
94 | format. | |
020e5f85 | 95 | |
143c145e LZ |
96 | 3. Defining an event-enabled tracepoint |
97 | ======================================= | |
98 | ||
99 | See The example provided in samples/trace_events | |
100 | ||
95b69608 TZ |
101 | 4. Event formats |
102 | ================ | |
103 | ||
104 | Each trace event has a 'format' file associated with it that contains | |
105 | a description of each field in a logged event. This information can | |
106 | be used to parse the binary trace stream, and is also the place to | |
107 | find the field names that can be used in event filters (see section 5). | |
108 | ||
109 | It also displays the format string that will be used to print the | |
110 | event in text mode, along with the event name and ID used for | |
111 | profiling. | |
112 | ||
113 | Every event has a set of 'common' fields associated with it; these are | |
114 | the fields prefixed with 'common_'. The other fields vary between | |
115 | events and correspond to the fields defined in the TRACE_EVENT | |
116 | definition for that event. | |
117 | ||
118 | Each field in the format has the form: | |
119 | ||
120 | field:field-type field-name; offset:N; size:N; | |
121 | ||
122 | where offset is the offset of the field in the trace record and size | |
123 | is the size of the data item, in bytes. | |
124 | ||
125 | For example, here's the information displayed for the 'sched_wakeup' | |
126 | event: | |
127 | ||
bfc672dc | 128 | # cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/format |
95b69608 TZ |
129 | |
130 | name: sched_wakeup | |
131 | ID: 60 | |
132 | format: | |
133 | field:unsigned short common_type; offset:0; size:2; | |
134 | field:unsigned char common_flags; offset:2; size:1; | |
135 | field:unsigned char common_preempt_count; offset:3; size:1; | |
136 | field:int common_pid; offset:4; size:4; | |
137 | field:int common_tgid; offset:8; size:4; | |
138 | ||
139 | field:char comm[TASK_COMM_LEN]; offset:12; size:16; | |
140 | field:pid_t pid; offset:28; size:4; | |
141 | field:int prio; offset:32; size:4; | |
142 | field:int success; offset:36; size:4; | |
143 | field:int cpu; offset:40; size:4; | |
144 | ||
145 | print fmt: "task %s:%d [%d] success=%d [%03d]", REC->comm, REC->pid, | |
146 | REC->prio, REC->success, REC->cpu | |
147 | ||
148 | This event contains 10 fields, the first 5 common and the remaining 5 | |
149 | event-specific. All the fields for this event are numeric, except for | |
150 | 'comm' which is a string, a distinction important for event filtering. | |
151 | ||
152 | 5. Event filtering | |
153 | ================== | |
154 | ||
155 | Trace events can be filtered in the kernel by associating boolean | |
156 | 'filter expressions' with them. As soon as an event is logged into | |
157 | the trace buffer, its fields are checked against the filter expression | |
158 | associated with that event type. An event with field values that | |
159 | 'match' the filter will appear in the trace output, and an event whose | |
160 | values don't match will be discarded. An event with no filter | |
161 | associated with it matches everything, and is the default when no | |
162 | filter has been set for an event. | |
163 | ||
164 | 5.1 Expression syntax | |
165 | --------------------- | |
166 | ||
167 | A filter expression consists of one or more 'predicates' that can be | |
168 | combined using the logical operators '&&' and '||'. A predicate is | |
169 | simply a clause that compares the value of a field contained within a | |
170 | logged event with a constant value and returns either 0 or 1 depending | |
171 | on whether the field value matched (1) or didn't match (0): | |
172 | ||
173 | field-name relational-operator value | |
174 | ||
175 | Parentheses can be used to provide arbitrary logical groupings and | |
176 | double-quotes can be used to prevent the shell from interpreting | |
177 | operators as shell metacharacters. | |
178 | ||
179 | The field-names available for use in filters can be found in the | |
180 | 'format' files for trace events (see section 4). | |
181 | ||
182 | The relational-operators depend on the type of the field being tested: | |
183 | ||
184 | The operators available for numeric fields are: | |
185 | ||
1a891cf1 | 186 | ==, !=, <, <=, >, >=, & |
95b69608 TZ |
187 | |
188 | And for string fields they are: | |
189 | ||
c3e13c7c | 190 | ==, !=, ~ |
95b69608 | 191 | |
c3e13c7c SRRH |
192 | The glob (~) only accepts a wild card character (*) at the start and or |
193 | end of the string. For example: | |
194 | ||
195 | prev_comm ~ "*sh" | |
196 | prev_comm ~ "sh*" | |
197 | prev_comm ~ "*sh*" | |
198 | ||
199 | But does not allow for it to be within the string: | |
200 | ||
201 | prev_comm ~ "ba*sh" <-- is invalid | |
95b69608 | 202 | |
95b69608 TZ |
203 | 5.2 Setting filters |
204 | ------------------- | |
205 | ||
206 | A filter for an individual event is set by writing a filter expression | |
207 | to the 'filter' file for the given event. | |
208 | ||
209 | For example: | |
210 | ||
bfc672dc | 211 | # cd /sys/kernel/debug/tracing/events/sched/sched_wakeup |
95b69608 TZ |
212 | # echo "common_preempt_count > 4" > filter |
213 | ||
214 | A slightly more involved example: | |
215 | ||
bfc672dc | 216 | # cd /sys/kernel/debug/tracing/events/signal/signal_generate |
95b69608 TZ |
217 | # echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter |
218 | ||
219 | If there is an error in the expression, you'll get an 'Invalid | |
220 | argument' error when setting it, and the erroneous string along with | |
221 | an error message can be seen by looking at the filter e.g.: | |
222 | ||
bfc672dc | 223 | # cd /sys/kernel/debug/tracing/events/signal/signal_generate |
95b69608 TZ |
224 | # echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter |
225 | -bash: echo: write error: Invalid argument | |
226 | # cat filter | |
227 | ((sig >= 10 && sig < 15) || dsig == 17) && comm != bash | |
228 | ^ | |
229 | parse_error: Field not found | |
230 | ||
231 | Currently the caret ('^') for an error always appears at the beginning of | |
232 | the filter string; the error message should still be useful though | |
233 | even without more accurate position info. | |
234 | ||
235 | 5.3 Clearing filters | |
236 | -------------------- | |
237 | ||
238 | To clear the filter for an event, write a '0' to the event's filter | |
239 | file. | |
240 | ||
241 | To clear the filters for all events in a subsystem, write a '0' to the | |
242 | subsystem's filter file. | |
243 | ||
244 | 5.3 Subsystem filters | |
245 | --------------------- | |
246 | ||
247 | For convenience, filters for every event in a subsystem can be set or | |
248 | cleared as a group by writing a filter expression into the filter file | |
88393161 | 249 | at the root of the subsystem. Note however, that if a filter for any |
95b69608 TZ |
250 | event within the subsystem lacks a field specified in the subsystem |
251 | filter, or if the filter can't be applied for any other reason, the | |
252 | filter for that event will retain its previous setting. This can | |
253 | result in an unintended mixture of filters which could lead to | |
254 | confusing (to the user who might think different filters are in | |
255 | effect) trace output. Only filters that reference just the common | |
256 | fields can be guaranteed to propagate successfully to all events. | |
257 | ||
258 | Here are a few subsystem filter examples that also illustrate the | |
259 | above points: | |
260 | ||
88393161 | 261 | Clear the filters on all events in the sched subsystem: |
95b69608 TZ |
262 | |
263 | # cd /sys/kernel/debug/tracing/events/sched | |
264 | # echo 0 > filter | |
265 | # cat sched_switch/filter | |
266 | none | |
267 | # cat sched_wakeup/filter | |
268 | none | |
269 | ||
270 | Set a filter using only common fields for all events in the sched | |
88393161 | 271 | subsystem (all events end up with the same filter): |
95b69608 TZ |
272 | |
273 | # cd /sys/kernel/debug/tracing/events/sched | |
274 | # echo common_pid == 0 > filter | |
275 | # cat sched_switch/filter | |
276 | common_pid == 0 | |
277 | # cat sched_wakeup/filter | |
278 | common_pid == 0 | |
279 | ||
280 | Attempt to set a filter using a non-common field for all events in the | |
88393161 | 281 | sched subsystem (all events but those that have a prev_pid field retain |
95b69608 TZ |
282 | their old filters): |
283 | ||
284 | # cd /sys/kernel/debug/tracing/events/sched | |
285 | # echo prev_pid == 0 > filter | |
286 | # cat sched_switch/filter | |
287 | prev_pid == 0 | |
288 | # cat sched_wakeup/filter | |
289 | common_pid == 0 | |
ac38fb85 TZ |
290 | |
291 | 6. Event triggers | |
292 | ================= | |
293 | ||
294 | Trace events can be made to conditionally invoke trigger 'commands' | |
295 | which can take various forms and are described in detail below; | |
296 | examples would be enabling or disabling other trace events or invoking | |
297 | a stack trace whenever the trace event is hit. Whenever a trace event | |
298 | with attached triggers is invoked, the set of trigger commands | |
299 | associated with that event is invoked. Any given trigger can | |
300 | additionally have an event filter of the same form as described in | |
301 | section 5 (Event filtering) associated with it - the command will only | |
302 | be invoked if the event being invoked passes the associated filter. | |
303 | If no filter is associated with the trigger, it always passes. | |
304 | ||
305 | Triggers are added to and removed from a particular event by writing | |
306 | trigger expressions to the 'trigger' file for the given event. | |
307 | ||
308 | A given event can have any number of triggers associated with it, | |
309 | subject to any restrictions that individual commands may have in that | |
310 | regard. | |
311 | ||
312 | Event triggers are implemented on top of "soft" mode, which means that | |
313 | whenever a trace event has one or more triggers associated with it, | |
314 | the event is activated even if it isn't actually enabled, but is | |
315 | disabled in a "soft" mode. That is, the tracepoint will be called, | |
316 | but just will not be traced, unless of course it's actually enabled. | |
317 | This scheme allows triggers to be invoked even for events that aren't | |
318 | enabled, and also allows the current event filter implementation to be | |
319 | used for conditionally invoking triggers. | |
320 | ||
321 | The syntax for event triggers is roughly based on the syntax for | |
322 | set_ftrace_filter 'ftrace filter commands' (see the 'Filter commands' | |
323 | section of Documentation/trace/ftrace.txt), but there are major | |
324 | differences and the implementation isn't currently tied to it in any | |
325 | way, so beware about making generalizations between the two. | |
326 | ||
327 | 6.1 Expression syntax | |
328 | --------------------- | |
329 | ||
330 | Triggers are added by echoing the command to the 'trigger' file: | |
331 | ||
332 | # echo 'command[:count] [if filter]' > trigger | |
333 | ||
334 | Triggers are removed by echoing the same command but starting with '!' | |
335 | to the 'trigger' file: | |
336 | ||
337 | # echo '!command[:count] [if filter]' > trigger | |
338 | ||
339 | The [if filter] part isn't used in matching commands when removing, so | |
340 | leaving that off in a '!' command will accomplish the same thing as | |
341 | having it in. | |
342 | ||
343 | The filter syntax is the same as that described in the 'Event | |
344 | filtering' section above. | |
345 | ||
346 | For ease of use, writing to the trigger file using '>' currently just | |
347 | adds or removes a single trigger and there's no explicit '>>' support | |
348 | ('>' actually behaves like '>>') or truncation support to remove all | |
349 | triggers (you have to use '!' for each one added.) | |
350 | ||
351 | 6.2 Supported trigger commands | |
352 | ------------------------------ | |
353 | ||
354 | The following commands are supported: | |
355 | ||
356 | - enable_event/disable_event | |
357 | ||
358 | These commands can enable or disable another trace event whenever | |
359 | the triggering event is hit. When these commands are registered, | |
360 | the other trace event is activated, but disabled in a "soft" mode. | |
361 | That is, the tracepoint will be called, but just will not be traced. | |
362 | The event tracepoint stays in this mode as long as there's a trigger | |
363 | in effect that can trigger it. | |
364 | ||
365 | For example, the following trigger causes kmalloc events to be | |
366 | traced when a read system call is entered, and the :1 at the end | |
367 | specifies that this enablement happens only once: | |
368 | ||
369 | # echo 'enable_event:kmem:kmalloc:1' > \ | |
370 | /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger | |
371 | ||
372 | The following trigger causes kmalloc events to stop being traced | |
373 | when a read system call exits. This disablement happens on every | |
374 | read system call exit: | |
375 | ||
376 | # echo 'disable_event:kmem:kmalloc' > \ | |
377 | /sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger | |
378 | ||
379 | The format is: | |
380 | ||
381 | enable_event:<system>:<event>[:count] | |
382 | disable_event:<system>:<event>[:count] | |
383 | ||
384 | To remove the above commands: | |
385 | ||
386 | # echo '!enable_event:kmem:kmalloc:1' > \ | |
387 | /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger | |
388 | ||
389 | # echo '!disable_event:kmem:kmalloc' > \ | |
390 | /sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger | |
391 | ||
392 | Note that there can be any number of enable/disable_event triggers | |
393 | per triggering event, but there can only be one trigger per | |
394 | triggered event. e.g. sys_enter_read can have triggers enabling both | |
395 | kmem:kmalloc and sched:sched_switch, but can't have two kmem:kmalloc | |
396 | versions such as kmem:kmalloc and kmem:kmalloc:1 or 'kmem:kmalloc if | |
397 | bytes_req == 256' and 'kmem:kmalloc if bytes_alloc == 256' (they | |
398 | could be combined into a single filter on kmem:kmalloc though). | |
399 | ||
400 | - stacktrace | |
401 | ||
402 | This command dumps a stacktrace in the trace buffer whenever the | |
403 | triggering event occurs. | |
404 | ||
405 | For example, the following trigger dumps a stacktrace every time the | |
406 | kmalloc tracepoint is hit: | |
407 | ||
408 | # echo 'stacktrace' > \ | |
409 | /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger | |
410 | ||
411 | The following trigger dumps a stacktrace the first 5 times a kmalloc | |
412 | request happens with a size >= 64K | |
413 | ||
414 | # echo 'stacktrace:5 if bytes_req >= 65536' > \ | |
415 | /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger | |
416 | ||
417 | The format is: | |
418 | ||
419 | stacktrace[:count] | |
420 | ||
421 | To remove the above commands: | |
422 | ||
423 | # echo '!stacktrace' > \ | |
424 | /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger | |
425 | ||
426 | # echo '!stacktrace:5 if bytes_req >= 65536' > \ | |
427 | /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger | |
428 | ||
429 | The latter can also be removed more simply by the following (without | |
430 | the filter): | |
431 | ||
432 | # echo '!stacktrace:5' > \ | |
433 | /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger | |
434 | ||
435 | Note that there can be only one stacktrace trigger per triggering | |
436 | event. | |
437 | ||
438 | - snapshot | |
439 | ||
440 | This command causes a snapshot to be triggered whenever the | |
441 | triggering event occurs. | |
442 | ||
443 | The following command creates a snapshot every time a block request | |
444 | queue is unplugged with a depth > 1. If you were tracing a set of | |
445 | events or functions at the time, the snapshot trace buffer would | |
c98be0c9 | 446 | capture those events when the trigger event occurred: |
ac38fb85 TZ |
447 | |
448 | # echo 'snapshot if nr_rq > 1' > \ | |
449 | /sys/kernel/debug/tracing/events/block/block_unplug/trigger | |
450 | ||
451 | To only snapshot once: | |
452 | ||
453 | # echo 'snapshot:1 if nr_rq > 1' > \ | |
454 | /sys/kernel/debug/tracing/events/block/block_unplug/trigger | |
455 | ||
456 | To remove the above commands: | |
457 | ||
458 | # echo '!snapshot if nr_rq > 1' > \ | |
459 | /sys/kernel/debug/tracing/events/block/block_unplug/trigger | |
460 | ||
461 | # echo '!snapshot:1 if nr_rq > 1' > \ | |
462 | /sys/kernel/debug/tracing/events/block/block_unplug/trigger | |
463 | ||
464 | Note that there can be only one snapshot trigger per triggering | |
465 | event. | |
466 | ||
467 | - traceon/traceoff | |
468 | ||
469 | These commands turn tracing on and off when the specified events are | |
470 | hit. The parameter determines how many times the tracing system is | |
471 | turned on and off. If unspecified, there is no limit. | |
472 | ||
473 | The following command turns tracing off the first time a block | |
474 | request queue is unplugged with a depth > 1. If you were tracing a | |
475 | set of events or functions at the time, you could then examine the | |
476 | trace buffer to see the sequence of events that led up to the | |
477 | trigger event: | |
478 | ||
479 | # echo 'traceoff:1 if nr_rq > 1' > \ | |
480 | /sys/kernel/debug/tracing/events/block/block_unplug/trigger | |
481 | ||
482 | To always disable tracing when nr_rq > 1 : | |
483 | ||
484 | # echo 'traceoff if nr_rq > 1' > \ | |
485 | /sys/kernel/debug/tracing/events/block/block_unplug/trigger | |
486 | ||
487 | To remove the above commands: | |
488 | ||
489 | # echo '!traceoff:1 if nr_rq > 1' > \ | |
490 | /sys/kernel/debug/tracing/events/block/block_unplug/trigger | |
491 | ||
492 | # echo '!traceoff if nr_rq > 1' > \ | |
493 | /sys/kernel/debug/tracing/events/block/block_unplug/trigger | |
494 | ||
495 | Note that there can be only one traceon or traceoff trigger per | |
496 | triggering event. |