]>
Commit | Line | Data |
---|---|---|
bdf3a950 | 1 | =================================== |
0d431558 | 2 | Supporting PMUs on RISC-V platforms |
bdf3a950 MCC |
3 | =================================== |
4 | ||
0d431558 AK |
5 | Alan Kao <alankao@andestech.com>, Mar 2018 |
6 | ||
7 | Introduction | |
8 | ------------ | |
9 | ||
10 | As of this writing, perf_event-related features mentioned in The RISC-V ISA | |
11 | Privileged Version 1.10 are as follows: | |
12 | (please check the manual for more details) | |
13 | ||
14 | * [m|s]counteren | |
15 | * mcycle[h], cycle[h] | |
16 | * minstret[h], instret[h] | |
17 | * mhpeventx, mhpcounterx[h] | |
18 | ||
19 | With such function set only, porting perf would require a lot of work, due to | |
20 | the lack of the following general architectural performance monitoring features: | |
21 | ||
22 | * Enabling/Disabling counters | |
23 | Counters are just free-running all the time in our case. | |
24 | * Interrupt caused by counter overflow | |
25 | No such feature in the spec. | |
26 | * Interrupt indicator | |
27 | It is not possible to have many interrupt ports for all counters, so an | |
28 | interrupt indicator is required for software to tell which counter has | |
29 | just overflowed. | |
30 | * Writing to counters | |
31 | There will be an SBI to support this since the kernel cannot modify the | |
32 | counters [1]. Alternatively, some vendor considers to implement | |
33 | hardware-extension for M-S-U model machines to write counters directly. | |
34 | ||
35 | This document aims to provide developers a quick guide on supporting their | |
36 | PMUs in the kernel. The following sections briefly explain perf' mechanism | |
37 | and todos. | |
38 | ||
39 | You may check previous discussions here [1][2]. Also, it might be helpful | |
40 | to check the appendix for related kernel structures. | |
41 | ||
42 | ||
43 | 1. Initialization | |
44 | ----------------- | |
45 | ||
46 | *riscv_pmu* is a global pointer of type *struct riscv_pmu*, which contains | |
47 | various methods according to perf's internal convention and PMU-specific | |
48 | parameters. One should declare such instance to represent the PMU. By default, | |
49 | *riscv_pmu* points to a constant structure *riscv_base_pmu*, which has very | |
50 | basic support to a baseline QEMU model. | |
51 | ||
52 | Then he/she can either assign the instance's pointer to *riscv_pmu* so that | |
53 | the minimal and already-implemented logic can be leveraged, or invent his/her | |
54 | own *riscv_init_platform_pmu* implementation. | |
55 | ||
56 | In other words, existing sources of *riscv_base_pmu* merely provide a | |
57 | reference implementation. Developers can flexibly decide how many parts they | |
58 | can leverage, and in the most extreme case, they can customize every function | |
59 | according to their needs. | |
60 | ||
61 | ||
62 | 2. Event Initialization | |
63 | ----------------------- | |
64 | ||
65 | When a user launches a perf command to monitor some events, it is first | |
66 | interpreted by the userspace perf tool into multiple *perf_event_open* | |
67 | system calls, and then each of them calls to the body of *event_init* | |
68 | member function that was assigned in the previous step. In *riscv_base_pmu*'s | |
69 | case, it is *riscv_event_init*. | |
70 | ||
71 | The main purpose of this function is to translate the event provided by user | |
72 | into bitmap, so that HW-related control registers or counters can directly be | |
73 | manipulated. The translation is based on the mappings and methods provided in | |
74 | *riscv_pmu*. | |
75 | ||
76 | Note that some features can be done in this stage as well: | |
77 | ||
78 | (1) interrupt setting, which is stated in the next section; | |
79 | (2) privilege level setting (user space only, kernel space only, both); | |
80 | (3) destructor setting. Normally it is sufficient to apply *riscv_destroy_event*; | |
81 | (4) tweaks for non-sampling events, which will be utilized by functions such as | |
bdf3a950 | 82 | *perf_adjust_period*, usually something like the follows:: |
0d431558 | 83 | |
bdf3a950 MCC |
84 | if (!is_sampling_event(event)) { |
85 | hwc->sample_period = x86_pmu.max_period; | |
86 | hwc->last_period = hwc->sample_period; | |
87 | local64_set(&hwc->period_left, hwc->sample_period); | |
88 | } | |
0d431558 AK |
89 | |
90 | In the case of *riscv_base_pmu*, only (3) is provided for now. | |
91 | ||
92 | ||
93 | 3. Interrupt | |
94 | ------------ | |
95 | ||
96 | 3.1. Interrupt Initialization | |
97 | ||
98 | This often occurs at the beginning of the *event_init* method. In common | |
bdf3a950 | 99 | practice, this should be a code segment like:: |
0d431558 | 100 | |
bdf3a950 MCC |
101 | int x86_reserve_hardware(void) |
102 | { | |
0d431558 AK |
103 | int err = 0; |
104 | ||
105 | if (!atomic_inc_not_zero(&pmc_refcount)) { | |
106 | mutex_lock(&pmc_reserve_mutex); | |
107 | if (atomic_read(&pmc_refcount) == 0) { | |
108 | if (!reserve_pmc_hardware()) | |
109 | err = -EBUSY; | |
110 | else | |
111 | reserve_ds_buffers(); | |
112 | } | |
113 | if (!err) | |
114 | atomic_inc(&pmc_refcount); | |
115 | mutex_unlock(&pmc_reserve_mutex); | |
116 | } | |
117 | ||
118 | return err; | |
bdf3a950 | 119 | } |
0d431558 AK |
120 | |
121 | And the magic is in *reserve_pmc_hardware*, which usually does atomic | |
122 | operations to make implemented IRQ accessible from some global function pointer. | |
123 | *release_pmc_hardware* serves the opposite purpose, and it is used in event | |
124 | destructors mentioned in previous section. | |
125 | ||
126 | (Note: From the implementations in all the architectures, the *reserve/release* | |
127 | pair are always IRQ settings, so the *pmc_hardware* seems somehow misleading. | |
128 | It does NOT deal with the binding between an event and a physical counter, | |
129 | which will be introduced in the next section.) | |
130 | ||
131 | 3.2. IRQ Structure | |
132 | ||
bdf3a950 | 133 | Basically, a IRQ runs the following pseudo code:: |
0d431558 | 134 | |
bdf3a950 | 135 | for each hardware counter that triggered this overflow |
0d431558 | 136 | |
bdf3a950 | 137 | get the event of this counter |
0d431558 | 138 | |
bdf3a950 MCC |
139 | // following two steps are defined as *read()*, |
140 | // check the section Reading/Writing Counters for details. | |
141 | count the delta value since previous interrupt | |
142 | update the event->count (# event occurs) by adding delta, and | |
143 | event->hw.period_left by subtracting delta | |
0d431558 | 144 | |
bdf3a950 MCC |
145 | if the event overflows |
146 | sample data | |
147 | set the counter appropriately for the next overflow | |
0d431558 | 148 | |
bdf3a950 MCC |
149 | if the event overflows again |
150 | too frequently, throttle this event | |
151 | fi | |
152 | fi | |
0d431558 | 153 | |
bdf3a950 | 154 | end for |
0d431558 AK |
155 | |
156 | However as of this writing, none of the RISC-V implementations have designed an | |
157 | interrupt for perf, so the details are to be completed in the future. | |
158 | ||
159 | 4. Reading/Writing Counters | |
160 | --------------------------- | |
161 | ||
162 | They seem symmetric but perf treats them quite differently. For reading, there | |
163 | is a *read* interface in *struct pmu*, but it serves more than just reading. | |
164 | According to the context, the *read* function not only reads the content of the | |
165 | counter (event->count), but also updates the left period to the next interrupt | |
166 | (event->hw.period_left). | |
167 | ||
168 | But the core of perf does not need direct write to counters. Writing counters | |
169 | is hidden behind the abstraction of 1) *pmu->start*, literally start counting so one | |
170 | has to set the counter to a good value for the next interrupt; 2) inside the IRQ | |
171 | it should set the counter to the same resonable value. | |
172 | ||
173 | Reading is not a problem in RISC-V but writing would need some effort, since | |
174 | counters are not allowed to be written by S-mode. | |
175 | ||
176 | ||
177 | 5. add()/del()/start()/stop() | |
178 | ----------------------------- | |
179 | ||
180 | Basic idea: add()/del() adds/deletes events to/from a PMU, and start()/stop() | |
181 | starts/stop the counter of some event in the PMU. All of them take the same | |
182 | arguments: *struct perf_event *event* and *int flag*. | |
183 | ||
184 | Consider perf as a state machine, then you will find that these functions serve | |
185 | as the state transition process between those states. | |
186 | Three states (event->hw.state) are defined: | |
187 | ||
188 | * PERF_HES_STOPPED: the counter is stopped | |
189 | * PERF_HES_UPTODATE: the event->count is up-to-date | |
190 | * PERF_HES_ARCH: arch-dependent usage ... we don't need this for now | |
191 | ||
192 | A normal flow of these state transitions are as follows: | |
193 | ||
194 | * A user launches a perf event, resulting in calling to *event_init*. | |
195 | * When being context-switched in, *add* is called by the perf core, with a flag | |
196 | PERF_EF_START, which means that the event should be started after it is added. | |
197 | At this stage, a general event is bound to a physical counter, if any. | |
198 | The state changes to PERF_HES_STOPPED and PERF_HES_UPTODATE, because it is now | |
199 | stopped, and the (software) event count does not need updating. | |
bdf3a950 MCC |
200 | |
201 | - *start* is then called, and the counter is enabled. | |
202 | With flag PERF_EF_RELOAD, it writes an appropriate value to the counter (check | |
203 | previous section for detail). | |
204 | Nothing is written if the flag does not contain PERF_EF_RELOAD. | |
205 | The state now is reset to none, because it is neither stopped nor updated | |
206 | (the counting already started) | |
207 | ||
0d431558 AK |
208 | * When being context-switched out, *del* is called. It then checks out all the |
209 | events in the PMU and calls *stop* to update their counts. | |
0d431558 | 210 | |
bdf3a950 MCC |
211 | - *stop* is called by *del* |
212 | and the perf core with flag PERF_EF_UPDATE, and it often shares the same | |
213 | subroutine as *read* with the same logic. | |
214 | The state changes to PERF_HES_STOPPED and PERF_HES_UPTODATE, again. | |
215 | ||
216 | - Life cycle of these two pairs: *add* and *del* are called repeatedly as | |
217 | tasks switch in-and-out; *start* and *stop* is also called when the perf core | |
218 | needs a quick stop-and-start, for instance, when the interrupt period is being | |
219 | adjusted. | |
0d431558 AK |
220 | |
221 | Current implementation is sufficient for now and can be easily extended to | |
222 | features in the future. | |
223 | ||
224 | A. Related Structures | |
225 | --------------------- | |
226 | ||
227 | * struct pmu: include/linux/perf_event.h | |
228 | * struct riscv_pmu: arch/riscv/include/asm/perf_event.h | |
229 | ||
230 | Both structures are designed to be read-only. | |
231 | ||
232 | *struct pmu* defines some function pointer interfaces, and most of them take | |
bdf3a950 MCC |
233 | *struct perf_event* as a main argument, dealing with perf events according to |
234 | perf's internal state machine (check kernel/events/core.c for details). | |
0d431558 AK |
235 | |
236 | *struct riscv_pmu* defines PMU-specific parameters. The naming follows the | |
bdf3a950 | 237 | convention of all other architectures. |
0d431558 AK |
238 | |
239 | * struct perf_event: include/linux/perf_event.h | |
240 | * struct hw_perf_event | |
241 | ||
242 | The generic structure that represents perf events, and the hardware-related | |
bdf3a950 | 243 | details. |
0d431558 AK |
244 | |
245 | * struct riscv_hw_events: arch/riscv/include/asm/perf_event.h | |
246 | ||
247 | The structure that holds the status of events, has two fixed members: | |
bdf3a950 | 248 | the number of events and the array of the events. |
0d431558 AK |
249 | |
250 | References | |
251 | ---------- | |
252 | ||
253 | [1] https://github.com/riscv/riscv-linux/pull/124 | |
bdf3a950 | 254 | |
0d431558 | 255 | [2] https://groups.google.com/a/groups.riscv.org/forum/#!topic/sw-dev/f19TmCNP6yA |