]>
Commit | Line | Data |
---|---|---|
a1dfa0c6 XL |
1 | # Profiling with perf |
2 | ||
3 | This is a guide for how to profile rustc with [perf](https://perf.wiki.kernel.org/index.php/Main_Page). | |
4 | ||
5 | ## Initial steps | |
6 | ||
7 | - Get a clean checkout of rust-lang/master, or whatever it is you want | |
8 | to profile. | |
9 | - Set the following settings in your `config.toml`: | |
dc9dc135 | 10 | - `debuginfo-level = 1` - enables line debuginfo |
ba9703b0 | 11 | - `jemalloc = false` - lets you do memory use profiling with valgrind |
a1dfa0c6 XL |
12 | - leave everything else the defaults |
13 | - Run `./x.py build` to get a full build | |
14 | - Make a rustup toolchain pointing to that result | |
15 | - see [the "build and run" section for instructions][b-a-r] | |
16 | ||
60c5eb7d | 17 | [b-a-r]: ../building/how-to-build-and-run.html#toolchain |
dc9dc135 | 18 | |
a1dfa0c6 XL |
19 | ## Gathering a perf profile |
20 | ||
21 | perf is an excellent tool on linux that can be used to gather and | |
22 | analyze all kinds of information. Mostly it is used to figure out | |
23 | where a program spends its time. It can also be used for other sorts | |
24 | of events, though, like cache misses and so forth. | |
25 | ||
26 | ### The basics | |
27 | ||
28 | The basic `perf` command is this: | |
29 | ||
30 | ```bash | |
60c5eb7d | 31 | perf record -F99 --call-graph dwarf XXX |
a1dfa0c6 XL |
32 | ``` |
33 | ||
34 | The `-F99` tells perf to sample at 99 Hz, which avoids generating too | |
35 | much data for longer runs (why 99 Hz you ask? It is often chosen | |
36 | because it is unlikely to be in lockstep with other periodic | |
37 | activity). The `--call-graph dwarf` tells perf to get call-graph | |
38 | information from debuginfo, which is accurate. The `XXX` is the | |
39 | command you want to profile. So, for example, you might do: | |
40 | ||
41 | ```bash | |
60c5eb7d | 42 | perf record -F99 --call-graph dwarf cargo +<toolchain> rustc |
a1dfa0c6 XL |
43 | ``` |
44 | ||
45 | to run `cargo` -- here `<toolchain>` should be the name of the toolchain | |
46 | you made in the beginning. But there are some things to be aware of: | |
47 | ||
48 | - You probably don't want to profile the time spend building | |
49 | dependencies. So something like `cargo build; cargo clean -p $C` may | |
50 | be helpful (where `$C` is the crate name) | |
51 | - Though usually I just do `touch src/lib.rs` and rebuild instead. =) | |
52 | - You probably don't want incremental messing about with your | |
53 | profile. So something like `CARGO_INCREMENTAL=0` can be helpful. | |
54 | ||
55 | ### Gathering a perf profile from a `perf.rust-lang.org` test | |
56 | ||
57 | Often we want to analyze a specific test from `perf.rust-lang.org`. To | |
58 | do that, the first step is to clone | |
59 | [the rustc-perf repository][rustc-perf-gh]: | |
60 | ||
61 | ```bash | |
ba9703b0 | 62 | git clone https://github.com/rust-lang/rustc-perf |
a1dfa0c6 XL |
63 | ``` |
64 | ||
ba9703b0 | 65 | [rustc-perf-gh]: https://github.com/rust-lang/rustc-perf |
a1dfa0c6 XL |
66 | |
67 | #### Doing it the easy way | |
68 | ||
69 | Once you've cloned the repo, you can use the `collector` executable to | |
70 | do profiling for you! You can find | |
71 | [instructions in the rustc-perf readme][rustc-perf-readme]. | |
72 | ||
ba9703b0 | 73 | [rustc-perf-readme]: https://github.com/rust-lang/rustc-perf/blob/master/collector/README.md#profiling |
a1dfa0c6 XL |
74 | |
75 | For example, to measure the clap-rs test, you might do: | |
76 | ||
77 | ```bash | |
60c5eb7d XL |
78 | ./target/release/collector \ |
79 | --output-repo /path/to/place/output \ | |
80 | profile perf-record \ | |
81 | --rustc /path/to/rustc/executable/from/your/build/directory \ | |
82 | --cargo `which cargo` \ | |
83 | --filter clap-rs \ | |
84 | --builds Check \ | |
a1dfa0c6 XL |
85 | ``` |
86 | ||
87 | You can also use that same command to use cachegrind or other profiling tools. | |
88 | ||
89 | #### Doing it the hard way | |
90 | ||
91 | If you prefer to run things manually, that is also possible. You first | |
92 | need to find the source for the test you want. Sources for the tests | |
93 | are found in [the `collector/benchmarks` directory][dir]. So let's go | |
94 | into the directory of a specific test; we'll use `clap-rs` as an | |
95 | example: | |
96 | ||
ba9703b0 | 97 | [dir]: https://github.com/rust-lang/rustc-perf/tree/master/collector/benchmarks |
a1dfa0c6 XL |
98 | |
99 | ```bash | |
60c5eb7d | 100 | cd collector/benchmarks/clap-rs |
a1dfa0c6 XL |
101 | ``` |
102 | ||
103 | In this case, let's say we want to profile the `cargo check` | |
104 | performance. In that case, I would first run some basic commands to | |
105 | build the dependencies: | |
106 | ||
107 | ```bash | |
108 | # Setup: first clean out any old results and build the dependencies: | |
60c5eb7d XL |
109 | cargo +<toolchain> clean |
110 | CARGO_INCREMENTAL=0 cargo +<toolchain> check | |
a1dfa0c6 XL |
111 | ``` |
112 | ||
113 | (Again, `<toolchain>` should be replaced with the name of the | |
114 | toolchain we made in the first step.) | |
115 | ||
116 | Next: we want record the execution time for *just* the clap-rs crate, | |
117 | running cargo check. I tend to use `cargo rustc` for this, since it | |
118 | also allows me to add explicit flags, which we'll do later on. | |
119 | ||
120 | ```bash | |
60c5eb7d XL |
121 | touch src/lib.rs |
122 | CARGO_INCREMENTAL=0 perf record -F99 --call-graph dwarf cargo rustc --profile check --lib | |
a1dfa0c6 XL |
123 | ``` |
124 | ||
125 | Note that final command: it's a doozy! It uses the `cargo rustc` | |
126 | command, which executes rustc with (potentially) additional options; | |
127 | the `--profile check` and `--lib` options specify that we are doing a | |
128 | `cargo check` execution, and that this is a library (not a binary). | |
129 | ||
130 | At this point, we can use `perf` tooling to analyze the results. For example: | |
131 | ||
132 | ```bash | |
60c5eb7d | 133 | perf report |
a1dfa0c6 XL |
134 | ``` |
135 | ||
136 | will open up an interactive TUI program. In simple cases, that can be | |
137 | helpful. For more detailed examination, the [`perf-focus` tool][pf] | |
138 | can be helpful; it is covered below. | |
139 | ||
140 | **A note of caution.** Each of the rustc-perf tests is its own special | |
141 | snowflake. In particular, some of them are not libraries, in which | |
142 | case you would want to do `touch src/main.rs` and avoid passing | |
143 | `--lib`. I'm not sure how best to tell which test is which to be | |
144 | honest. | |
145 | ||
146 | ### Gathering NLL data | |
147 | ||
148 | If you want to profile an NLL run, you can just pass extra options to | |
149 | the `cargo rustc` command, like so: | |
150 | ||
151 | ```bash | |
60c5eb7d | 152 | touch src/lib.rs |
6a06907d | 153 | CARGO_INCREMENTAL=0 perf record -F99 --call-graph dwarf cargo rustc --profile check --lib -- -Z borrowck=mir |
a1dfa0c6 XL |
154 | ``` |
155 | ||
156 | [pf]: https://github.com/nikomatsakis/perf-focus | |
157 | ||
158 | ## Analyzing a perf profile with `perf focus` | |
159 | ||
160 | Once you've gathered a perf profile, we want to get some information | |
161 | about it. For this, I personally use [perf focus][pf]. It's a kind of | |
162 | simple but useful tool that lets you answer queries like: | |
163 | ||
164 | - "how much time was spent in function F" (no matter where it was called from) | |
165 | - "how much time was spent in function F when it was called from G" | |
166 | - "how much time was spent in function F *excluding* time spent in G" | |
167 | - "what functions does F call and how much time does it spend in them" | |
168 | ||
169 | To understand how it works, you have to know just a bit about | |
170 | perf. Basically, perf works by *sampling* your process on a regular | |
171 | basis (or whenever some event occurs). For each sample, perf gathers a | |
172 | backtrace. `perf focus` lets you write a regular expression that tests | |
173 | which functions appear in that backtrace, and then tells you which | |
174 | percentage of samples had a backtrace that met the regular | |
175 | expression. It's probably easiest to explain by walking through how I | |
176 | would analyze NLL performance. | |
177 | ||
178 | ### Installing `perf-focus` | |
179 | ||
180 | You can install perf-focus using `cargo install`: | |
181 | ||
182 | ```bash | |
60c5eb7d | 183 | cargo install perf-focus |
a1dfa0c6 XL |
184 | ``` |
185 | ||
186 | ### Example: How much time is spent in MIR borrowck? | |
187 | ||
188 | Let's say we've gathered the NLL data for a test. We'd like to know | |
189 | how much time it is spending in the MIR borrow-checker. The "main" | |
190 | function of the MIR borrowck is called `do_mir_borrowck`, so we can do | |
191 | this command: | |
192 | ||
193 | ```bash | |
60c5eb7d | 194 | $ perf focus '{do_mir_borrowck}' |
a1dfa0c6 XL |
195 | Matcher : {do_mir_borrowck} |
196 | Matches : 228 | |
197 | Not Matches: 542 | |
198 | Percentage : 29% | |
199 | ``` | |
200 | ||
201 | The `'{do_mir_borrowck}'` argument is called the **matcher**. It | |
202 | specifies the test to be applied on the backtrace. In this case, the | |
203 | `{X}` indicates that there must be *some* function on the backtrace | |
204 | that meets the regular expression `X`. In this case, that regex is | |
205 | just the name of the function we want (in fact, it's a subset of the name; | |
206 | the full name includes a bunch of other stuff, like the module | |
207 | path). In this mode, perf-focus just prints out the percentage of | |
208 | samples where `do_mir_borrowck` was on the stack: in this case, 29%. | |
209 | ||
210 | **A note about c++filt.** To get the data from `perf`, `perf focus` | |
211 | currently executes `perf script` (perhaps there is a better | |
212 | way...). I've sometimes found that `perf script` outputs C++ mangled | |
213 | names. This is annoying. You can tell by running `perf script | | |
214 | head` yourself — if you see names like `5rustc6middle` instead of | |
215 | `rustc::middle`, then you have the same problem. You can solve this | |
216 | by doing: | |
217 | ||
218 | ```bash | |
60c5eb7d | 219 | perf script | c++filt | perf focus --from-stdin ... |
a1dfa0c6 XL |
220 | ``` |
221 | ||
222 | This will pipe the output from `perf script` through `c++filt` and | |
223 | should mostly convert those names into a more friendly format. The | |
224 | `--from-stdin` flag to `perf focus` tells it to get its data from | |
225 | stdin, rather than executing `perf focus`. We should make this more | |
226 | convenient (at worst, maybe add a `c++filt` option to `perf focus`, or | |
227 | just always use it — it's pretty harmless). | |
228 | ||
229 | ### Example: How much time does MIR borrowck spend solving traits? | |
230 | ||
231 | Perhaps we'd like to know how much time MIR borrowck spends in the | |
232 | trait checker. We can ask this using a more complex regex: | |
233 | ||
234 | ```bash | |
60c5eb7d | 235 | $ perf focus '{do_mir_borrowck}..{^rustc::traits}' |
a1dfa0c6 XL |
236 | Matcher : {do_mir_borrowck},..{^rustc::traits} |
237 | Matches : 12 | |
238 | Not Matches: 1311 | |
239 | Percentage : 0% | |
240 | ``` | |
241 | ||
242 | Here we used the `..` operator to ask "how often do we have | |
243 | `do_mir_borrowck` on the stack and then, later, some function whose | |
6a06907d | 244 | name begins with `rustc::traits`?" (basically, code in that module). It |
a1dfa0c6 XL |
245 | turns out the answer is "almost never" — only 12 samples fit that |
246 | description (if you ever see *no* samples, that often indicates your | |
247 | query is messed up). | |
248 | ||
249 | If you're curious, you can find out exactly which samples by using the | |
250 | `--print-match` option. This will print out the full backtrace for | |
251 | each sample. The `|` at the front of the line indicates the part that | |
252 | the regular expression matched. | |
253 | ||
254 | ### Example: Where does MIR borrowck spend its time? | |
255 | ||
256 | Often we want to do a more "explorational" queries. Like, we know that | |
257 | MIR borrowck is 29% of the time, but where does that time get spent? | |
258 | For that, the `--tree-callees` option is often the best tool. You | |
259 | usually also want to give `--tree-min-percent` or | |
260 | `--tree-max-depth`. The result looks like this: | |
261 | ||
262 | ```bash | |
60c5eb7d | 263 | $ perf focus '{do_mir_borrowck}' --tree-callees --tree-min-percent 3 |
a1dfa0c6 XL |
264 | Matcher : {do_mir_borrowck} |
265 | Matches : 577 | |
266 | Not Matches: 746 | |
267 | Percentage : 43% | |
268 | ||
269 | Tree | |
270 | | matched `{do_mir_borrowck}` (43% total, 0% self) | |
271 | : | rustc_mir::borrow_check::nll::compute_regions (20% total, 0% self) | |
272 | : : | rustc_mir::borrow_check::nll::type_check::type_check_internal (13% total, 0% self) | |
273 | : : : | core::ops::function::FnOnce::call_once (5% total, 0% self) | |
274 | : : : : | rustc_mir::borrow_check::nll::type_check::liveness::generate (5% total, 3% self) | |
dc9dc135 | 275 | : : : | <rustc_mir::borrow_check::nll::type_check::TypeVerifier<'a, 'b, 'tcx> as rustc::mir::visit::Visitor<'tcx>>::visit_mir (3% total, 0% self) |
a1dfa0c6 | 276 | : | rustc::mir::visit::Visitor::visit_mir (8% total, 6% self) |
dc9dc135 | 277 | : | <rustc_mir::borrow_check::MirBorrowckCtxt<'cx, 'tcx> as rustc_mir::dataflow::DataflowResultsConsumer<'cx, 'tcx>>::visit_statement_entry (5% total, 0% self) |
a1dfa0c6 XL |
278 | : | rustc_mir::dataflow::do_dataflow (3% total, 0% self) |
279 | ``` | |
280 | ||
281 | What happens with `--tree-callees` is that | |
282 | ||
283 | - we find each sample matching the regular expression | |
284 | - we look at the code that is occurs *after* the regex match and try | |
285 | to build up a call tree | |
286 | ||
287 | The `--tree-min-percent 3` option says "only show me things that take | |
288 | more than 3% of the time. Without this, the tree often gets really | |
289 | noisy and includes random stuff like the innards of | |
290 | malloc. `--tree-max-depth` can be useful too, it just limits how many | |
291 | levels we print. | |
292 | ||
293 | For each line, we display the percent of time in that function | |
294 | altogether ("total") and the percent of time spent in **just that | |
295 | function and not some callee of that function** (self). Usually | |
296 | "total" is the more interesting number, but not always. | |
297 | ||
dc9dc135 | 298 | ### Relative percentages |
a1dfa0c6 XL |
299 | |
300 | By default, all in perf-focus are relative to the **total program | |
301 | execution**. This is useful to help you keep perspective — often as | |
302 | we drill down to find hot spots, we can lose sight of the fact that, | |
303 | in terms of overall program execution, this "hot spot" is actually not | |
304 | important. It also ensures that percentages between different queries | |
305 | are easily compared against one another. | |
306 | ||
307 | That said, sometimes it's useful to get relative percentages, so `perf | |
308 | focus` offers a `--relative` option. In this case, the percentages are | |
309 | listed only for samples that match (vs all samples). So for example we | |
310 | could get our percentages relative to the borrowck itself | |
311 | like so: | |
312 | ||
313 | ```bash | |
60c5eb7d | 314 | $ perf focus '{do_mir_borrowck}' --tree-callees --relative --tree-max-depth 1 --tree-min-percent 5 |
a1dfa0c6 XL |
315 | Matcher : {do_mir_borrowck} |
316 | Matches : 577 | |
317 | Not Matches: 746 | |
318 | Percentage : 100% | |
319 | ||
320 | Tree | |
321 | | matched `{do_mir_borrowck}` (100% total, 0% self) | |
322 | : | rustc_mir::borrow_check::nll::compute_regions (47% total, 0% self) [...] | |
323 | : | rustc::mir::visit::Visitor::visit_mir (19% total, 15% self) [...] | |
dc9dc135 | 324 | : | <rustc_mir::borrow_check::MirBorrowckCtxt<'cx, 'tcx> as rustc_mir::dataflow::DataflowResultsConsumer<'cx, 'tcx>>::visit_statement_entry (13% total, 0% self) [...] |
a1dfa0c6 XL |
325 | : | rustc_mir::dataflow::do_dataflow (8% total, 1% self) [...] |
326 | ``` | |
327 | ||
328 | Here you see that `compute_regions` came up as "47% total" — that | |
329 | means that 47% of `do_mir_borrowck` is spent in that function. Before, | |
330 | we saw 20% — that's because `do_mir_borrowck` itself is only 43% of | |
331 | the total time (and `.47 * .43 = .20`). |