src/doc/rustc-dev-guide/src/profiling.md

   1 # Profiling the compiler
   2
   3 This section talks about how to profile the compiler and find out where it spends its time.
   4
   5 Depending on what you're trying to measure, there are several different approaches:
   6
   7 - If you want to see if a PR improves or regresses compiler performance:
   8   - The [rustc-perf](https://github.com/rust-lang/rustc-perf) project makes this easy and can be triggered to run on a PR via the `@rustc-perf` bot.
   9
  10 - If you want a medium-to-high level overview of where `rustc` is spending its time:
  11   - The `-Z self-profile` flag and [measureme](https://github.com/rust-lang/measureme) tools offer a query-based approach to profiling.
  12     See [their docs](https://github.com/rust-lang/measureme/blob/master/summarize/Readme.md) for more information.
  13
  14 - If you want function level performance data or even just more details than the above approaches:
  15   - Consider using a native code profiler such as [perf](profiling/with_perf.html)
  16   - or [tracy](https://github.com/nagisa/rust_tracy_client) for a nanosecond-precision,
  17     full-featured graphical interface.
  18
  19 - If you want a nice visual representation of the compile times of your crate graph,
  20   you can use [cargo's `-Z timings` flag](https://doc.rust-lang.org/cargo/reference/unstable.html#timings),
  21   eg. `cargo -Z timings build`.
  22   You can use this flag on the compiler itself with `CARGOFLAGS="-Z timings" ./x.py build`
  23
  24 - If you want to profile memory usage, you can use various tools depending on what operating system
  25   you are using.
  26   - For Windows, read our [WPA guide](profiling/wpa_profiling.html).
  27
  28 ## Optimizing rustc's bootstrap times with `cargo-llvm-lines`
  29
  30 Using [cargo-llvm-lines](https://github.com/dtolnay/cargo-llvm-lines) you can count the
  31 number of lines of LLVM IR across all instantiations of a generic function.
  32 Since most of the time compiling rustc is spent in LLVM, the idea is that by
  33 reducing the amount of code passed to LLVM, compiling rustc gets faster.
  34
  35 To use `cargo-llvm-lines` together with somewhat custom rustc build process, you can use
  36 `-C save-temps` to obtain required LLVM IR. The option preserves temporary work products
  37 created during compilation. Among those is LLVM IR that represents an input to the
  38 optimization pipeline; ideal for our purposes. It is stored in files with `*.no-opt.bc`
  39 extension in LLVM bitcode format.
  40
  41 Example usage:
  42 ```
  43 cargo install cargo-llvm-lines
  44 # On a normal crate you could now run `cargo llvm-lines`, but x.py isn't normal :P
  45
  46 # Do a clean before every run, to not mix in the results from previous runs.
  47 ./x.py clean
  48 env RUSTFLAGS=-Csave-temps ./x.py build --stage 0 compiler/rustc
  49
  50 # Single crate, e.g., rustc_middle. (Relies on the glob support of your shell.)
  51 # Convert unoptimized LLVM bitcode into a human readable LLVM assembly accepted by cargo-llvm-lines.
  52 for f in build/x86_64-unknown-linux-gnu/stage0-rustc/x86_64-unknown-linux-gnu/release/deps/rustc_middle-*.no-opt.bc; do
  53   ./build/x86_64-unknown-linux-gnu/llvm/bin/llvm-dis "$f"
  54 done
  55 cargo llvm-lines --files ./build/x86_64-unknown-linux-gnu/stage0-rustc/x86_64-unknown-linux-gnu/release/deps/rustc_middle-*.ll > llvm-lines-middle.txt
  56
  57 # Specify all crates of the compiler.
  58 for f in build/x86_64-unknown-linux-gnu/stage0-rustc/x86_64-unknown-linux-gnu/release/deps/*.no-opt.bc; do
  59   ./build/x86_64-unknown-linux-gnu/llvm/bin/llvm-dis "$f"
  60 done
  61 cargo llvm-lines --files ./build/x86_64-unknown-linux-gnu/stage0-rustc/x86_64-unknown-linux-gnu/release/deps/*.ll > llvm-lines.txt
  62 ```
  63
  64 Example output for the compiler:
  65 ```
  66   Lines            Copies          Function name
  67   -----            ------          -------------
  68   45207720 (100%)  1583774 (100%)  (TOTAL)
  69    2102350 (4.7%)   146650 (9.3%)  core::ptr::drop_in_place
  70     615080 (1.4%)     8392 (0.5%)  std::thread::local::LocalKey<T>::try_with
  71     594296 (1.3%)     1780 (0.1%)  hashbrown::raw::RawTable<T>::rehash_in_place
  72     592071 (1.3%)     9691 (0.6%)  core::option::Option<T>::map
  73     528172 (1.2%)     5741 (0.4%)  core::alloc::layout::Layout::array
  74     466854 (1.0%)     8863 (0.6%)  core::ptr::swap_nonoverlapping_one
  75     412736 (0.9%)     1780 (0.1%)  hashbrown::raw::RawTable<T>::resize
  76     367776 (0.8%)     2554 (0.2%)  alloc::raw_vec::RawVec<T,A>::grow_amortized
  77     367507 (0.8%)      643 (0.0%)  rustc_query_system::dep_graph::graph::DepGraph<K>::with_task_impl
  78     355882 (0.8%)     6332 (0.4%)  alloc::alloc::box_free
  79     354556 (0.8%)    14213 (0.9%)  core::ptr::write
  80     354361 (0.8%)     3590 (0.2%)  core::iter::traits::iterator::Iterator::fold
  81     347761 (0.8%)     3873 (0.2%)  rustc_middle::ty::context::tls::set_tlv
  82     337534 (0.7%)     2377 (0.2%)  alloc::raw_vec::RawVec<T,A>::allocate_in
  83     331690 (0.7%)     3192 (0.2%)  hashbrown::raw::RawTable<T>::find
  84     328756 (0.7%)     3978 (0.3%)  rustc_middle::ty::context::tls::with_context_opt
  85     326903 (0.7%)      642 (0.0%)  rustc_query_system::query::plumbing::try_execute_query
  86 ```
  87
  88 Since this doesn't seem to work with incremental compilation or `x.py check`,
  89 you will be compiling rustc _a lot_.
  90 I recommend changing a few settings in `config.toml` to make it bearable:
  91 ```
  92 [rust]
  93 # A debug build takes _a third_ as long on my machine,
  94 # but compiling more than stage0 rustc becomes unbearably slow.
  95 optimize = false
  96
  97 # We can't use incremental anyway, so we disable it for a little speed boost.
  98 incremental = false
  99 # We won't be running it, so no point in compiling debug checks.
 100 debug = false
 101
 102 # Using a single codegen unit gives less output, but is slower to compile.
 103 codegen-units = 0  # num_cpus
 104 ```
 105
 106 The llvm-lines output is affected by several options.
 107 `optimize = false` increases it from 2.1GB to 3.5GB and `codegen-units = 0` to 4.1GB.
 108
 109 MIR optimizations have little impact. Compared to the default `RUSTFLAGS="-Z
 110 mir-opt-level=1"`, level 0 adds 0.3GB and level 2 removes 0.2GB.
 111 As of <!-- date: 2021-01 --> January 2021, inlining currently only happens in
 112 LLVM but this might change in the future.