]>
Commit | Line | Data |
---|---|---|
e7bc62b6 IM |
1 | |
2 | Performance Counters for Linux | |
3 | ------------------------------ | |
4 | ||
5 | Performance counters are special hardware registers available on most modern | |
6 | CPUs. These registers count the number of certain types of hw events: such | |
7 | as instructions executed, cachemisses suffered, or branches mis-predicted - | |
8 | without slowing down the kernel or applications. These registers can also | |
9 | trigger interrupts when a threshold number of events have passed - and can | |
10 | thus be used to profile the code that runs on that CPU. | |
11 | ||
12 | The Linux Performance Counter subsystem provides an abstraction of these | |
13 | hardware capabilities. It provides per task and per CPU counters, and | |
14 | it provides event capabilities on top of those. | |
15 | ||
16 | Performance counters are accessed via special file descriptors. | |
17 | There's one file descriptor per virtual counter used. | |
18 | ||
19 | The special file descriptor is opened via the perf_counter_open() | |
20 | system call: | |
21 | ||
22 | int | |
23 | perf_counter_open(u32 hw_event_type, | |
24 | u32 hw_event_period, | |
25 | u32 record_type, | |
26 | pid_t pid, | |
27 | int cpu); | |
28 | ||
29 | The syscall returns the new fd. The fd can be used via the normal | |
30 | VFS system calls: read() can be used to read the counter, fcntl() | |
31 | can be used to set the blocking mode, etc. | |
32 | ||
33 | Multiple counters can be kept open at a time, and the counters | |
34 | can be poll()ed. | |
35 | ||
36 | When creating a new counter fd, 'hw_event_type' is one of: | |
37 | ||
38 | enum hw_event_types { | |
39 | PERF_COUNT_CYCLES, | |
40 | PERF_COUNT_INSTRUCTIONS, | |
41 | PERF_COUNT_CACHE_REFERENCES, | |
42 | PERF_COUNT_CACHE_MISSES, | |
43 | PERF_COUNT_BRANCH_INSTRUCTIONS, | |
44 | PERF_COUNT_BRANCH_MISSES, | |
45 | }; | |
46 | ||
47 | These are standardized types of events that work uniformly on all CPUs | |
48 | that implements Performance Counters support under Linux. If a CPU is | |
49 | not able to count branch-misses, then the system call will return | |
50 | -EINVAL. | |
51 | ||
52 | [ Note: more hw_event_types are supported as well, but they are CPU | |
53 | specific and are enumerated via /sys on a per CPU basis. Raw hw event | |
54 | types can be passed in as negative numbers. For example, to count | |
55 | "External bus cycles while bus lock signal asserted" events on Intel | |
56 | Core CPUs, pass in a -0x4064 event type value. ] | |
57 | ||
58 | The parameter 'hw_event_period' is the number of events before waking up | |
59 | a read() that is blocked on a counter fd. Zero value means a non-blocking | |
60 | counter. | |
61 | ||
62 | 'record_type' is the type of data that a read() will provide for the | |
63 | counter, and it can be one of: | |
64 | ||
65 | enum perf_record_type { | |
66 | PERF_RECORD_SIMPLE, | |
67 | PERF_RECORD_IRQ, | |
68 | }; | |
69 | ||
70 | a "simple" counter is one that counts hardware events and allows | |
71 | them to be read out into a u64 count value. (read() returns 8 on | |
72 | a successful read of a simple counter.) | |
73 | ||
74 | An "irq" counter is one that will also provide an IRQ context information: | |
75 | the IP of the interrupted context. In this case read() will return | |
76 | the 8-byte counter value, plus the Instruction Pointer address of the | |
77 | interrupted context. | |
78 | ||
79 | The 'pid' parameter allows the counter to be specific to a task: | |
80 | ||
81 | pid == 0: if the pid parameter is zero, the counter is attached to the | |
82 | current task. | |
83 | ||
84 | pid > 0: the counter is attached to a specific task (if the current task | |
85 | has sufficient privilege to do so) | |
86 | ||
87 | pid < 0: all tasks are counted (per cpu counters) | |
88 | ||
89 | The 'cpu' parameter allows a counter to be made specific to a full | |
90 | CPU: | |
91 | ||
92 | cpu >= 0: the counter is restricted to a specific CPU | |
93 | cpu == -1: the counter counts on all CPUs | |
94 | ||
95 | Note: the combination of 'pid == -1' and 'cpu == -1' is not valid. | |
96 | ||
97 | A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts | |
98 | events of that task and 'follows' that task to whatever CPU the task | |
99 | gets schedule to. Per task counters can be created by any user, for | |
100 | their own tasks. | |
101 | ||
102 | A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts | |
103 | all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege. | |
104 |