]> git.proxmox.com Git - rustc.git/blame - src/doc/rustc-dev-guide/src/overview.md
New upstream version 1.55.0+dfsg1
[rustc.git] / src / doc / rustc-dev-guide / src / overview.md
CommitLineData
ba9703b0
XL
1# Overview of the Compiler
2
6a06907d
XL
3<!-- toc -->
4
5This chapter is about the overall process of compiling a program -- how
6everything fits together.
7
8The rust compiler is special in two ways: it does things to your code that
9other compilers don't do (e.g. borrow checking) and it has a lot of
10unconventional implementation choices (e.g. queries). We will talk about these
11in turn in this chapter, and in the rest of the guide, we will look at all the
12individual pieces in more detail.
13
14## What the compiler does to your code
15
16So first, let's look at what the compiler does to your code. For now, we will
17avoid mentioning how the compiler implements these steps except as needed;
18we'll talk about that later.
19
20- The compile process begins when a user writes a Rust source program in text
21 and invokes the `rustc` compiler on it. The work that the compiler needs to
22 perform is defined by command-line options. For example, it is possible to
23 enable nightly features (`-Z` flags), perform `check`-only builds, or emit
24 LLVM-IR rather than executable machine code. The `rustc` executable call may
25 be indirect through the use of `cargo`.
26- Command line argument parsing occurs in the [`rustc_driver`]. This crate
27 defines the compile configuration that is requested by the user and passes it
28 to the rest of the compilation process as a [`rustc_interface::Config`].
29- The raw Rust source text is analyzed by a low-level lexer located in
30 [`rustc_lexer`]. At this stage, the source text is turned into a stream of
31 atomic source code units known as _tokens_. The lexer supports the
32 Unicode character encoding.
33- The token stream passes through a higher-level lexer located in
34 [`rustc_parse`] to prepare for the next stage of the compile process. The
35 [`StringReader`] struct is used at this stage to perform a set of validations
36 and turn strings into interned symbols (_interning_ is discussed later).
37 [String interning] is a way of storing only one immutable
38 copy of each distinct string value.
39
40- The lexer has a small interface and doesn't depend directly on the
41 diagnostic infrastructure in `rustc`. Instead it provides diagnostics as plain
42 data which are emitted in `rustc_parse::lexer::mod` as real diagnostics.
43- The lexer preserves full fidelity information for both IDEs and proc macros.
44- The parser [translates the token stream from the lexer into an Abstract Syntax
45 Tree (AST)][parser]. It uses a recursive descent (top-down) approach to syntax
46 analysis. The crate entry points for the parser are the `Parser::parse_crate_mod()` and
47 `Parser::parse_mod()` methods found in `rustc_parse::parser::item`. The external
48 module parsing entry point is `rustc_expand::module::parse_external_mod`. And
49 the macro parser entry point is [`Parser::parse_nonterminal()`][parse_nonterminal].
50- Parsing is performed with a set of `Parser` utility methods including `fn bump`,
51 `fn check`, `fn eat`, `fn expect`, `fn look_ahead`.
52- Parsing is organized by the semantic construct that is being parsed. Separate
53 `parse_*` methods can be found in `rustc_parse` `parser` directory. The source
54 file name follows the construct name. For example, the following files are found
55 in the parser:
56 - `expr.rs`
57 - `pat.rs`
58 - `ty.rs`
59 - `stmt.rs`
60- This naming scheme is used across many compiler stages. You will find
61 either a file or directory with the same name across the parsing, lowering,
62 type checking, THIR lowering, and MIR building sources.
63- Macro expansion, AST validation, name resolution, and early linting takes place
64 during this stage of the compile process.
65- The parser uses the standard `DiagnosticBuilder` API for error handling, but we
66 try to recover, parsing a superset of Rust's grammar, while also emitting an error.
67- `rustc_ast::ast::{Crate, Mod, Expr, Pat, ...}` AST nodes are returned from the parser.
68- We then take the AST and [convert it to High-Level Intermediate
69 Representation (HIR)][hir]. This is a compiler-friendly representation of the
70 AST. This involves a lot of desugaring of things like loops and `async fn`.
136023e0
XL
71- We use the HIR to do [type inference] (the process of automatic
72 detection of the type of an expression), [trait solving] (the process
73 of pairing up an impl with each reference to a trait), and [type
74 checking] (the process of converting the types found in the HIR
75 (`hir::Ty`), which represent the syntactic things that the user wrote,
76 into the internal representation used by the compiler (`Ty<'tcx>`),
77 and using that information to verify the type safety, correctness and
78 coherence of the types used in the program).
6a06907d
XL
79- The HIR is then [lowered to Mid-Level Intermediate Representation (MIR)][mir].
80 - Along the way, we construct the THIR, which is an even more desugared HIR.
81 THIR is used for pattern and exhaustiveness checking. It is also more
82 convenient to convert into MIR than HIR is.
83- The MIR is used for [borrow checking].
84- We (want to) do [many optimizations on the MIR][mir-opt] because it is still
85 generic and that improves the code we generate later, improving compilation
86 speed too.
87 - MIR is a higher level (and generic) representation, so it is easier to do
88 some optimizations at MIR level than at LLVM-IR level. For example LLVM
89 doesn't seem to be able to optimize the pattern the [`simplify_try`] mir
90 opt looks for.
91- Rust code is _monomorphized_, which means making copies of all the generic
92 code with the type parameters replaced by concrete types. To do
93 this, we need to collect a list of what concrete types to generate code for.
94 This is called _monomorphization collection_.
95- We then begin what is vaguely called _code generation_ or _codegen_.
96 - The [code generation stage (codegen)][codegen] is when higher level
97 representations of source are turned into an executable binary. `rustc`
98 uses LLVM for code generation. The first step is to convert the MIR
99 to LLVM Intermediate Representation (LLVM IR). This is where the MIR
100 is actually monomorphized, according to the list we created in the
101 previous step.
102 - The LLVM IR is passed to LLVM, which does a lot more optimizations on it.
103 It then emits machine code. It is basically assembly code with additional
104 low-level types and annotations added. (e.g. an ELF object or wasm).
105 - The different libraries/binaries are linked together to produce the final
106 binary.
107
108[String interning]: https://en.wikipedia.org/wiki/String_interning
109[`rustc_lexer`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/index.html
110[`rustc_driver`]: https://rustc-dev-guide.rust-lang.org/rustc-driver.html
111[`rustc_interface::Config`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_interface/interface/struct.Config.html
112[lex]: https://rustc-dev-guide.rust-lang.org/the-parser.html
113[`StringReader`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/lexer/struct.StringReader.html
114[`rustc_parse`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html
115[parser]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html
116[hir]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_hir/index.html
117[type inference]: https://rustc-dev-guide.rust-lang.org/type-inference.html
136023e0
XL
118[trait solving]: https://rustc-dev-guide.rust-lang.org/traits/resolution.html
119[type checking]: https://rustc-dev-guide.rust-lang.org/type-checking.html
6a06907d
XL
120[mir]: https://rustc-dev-guide.rust-lang.org/mir/index.html
121[borrow checking]: https://rustc-dev-guide.rust-lang.org/borrow_check.html
122[mir-opt]: https://rustc-dev-guide.rust-lang.org/mir/optimizations.html
123[`simplify_try`]: https://github.com/rust-lang/rust/pull/66282
124[codegen]: https://rustc-dev-guide.rust-lang.org/backend/codegen.html
125[parse_nonterminal]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/parser/struct.Parser.html#method.parse_nonterminal
126
127## How it does it
128
129Ok, so now that we have a high-level view of what the compiler does to your
130code, let's take a high-level view of _how_ it does all that stuff. There are a
131lot of constraints and conflicting goals that the compiler needs to
132satisfy/optimize for. For example,
133
134- Compilation speed: how fast is it to compile a program. More/better
135 compile-time analyses often means compilation is slower.
136 - Also, we want to support incremental compilation, so we need to take that
137 into account. How can we keep track of what work needs to be redone and
138 what can be reused if the user modifies their program?
139 - Also we can't store too much stuff in the incremental cache because
140 it would take a long time to load from disk and it could take a lot
141 of space on the user's system...
142- Compiler memory usage: while compiling a program, we don't want to use more
143 memory than we need.
144- Program speed: how fast is your compiled program. More/better compile-time
145 analyses often means the compiler can do better optimizations.
146- Program size: how large is the compiled binary? Similar to the previous
147 point.
148- Compiler compilation speed: how long does it take to compile the compiler?
149 This impacts contributors and compiler maintenance.
150- Implementation complexity: building a compiler is one of the hardest
151 things a person/group can do, and Rust is not a very simple language, so how
152 do we make the compiler's code base manageable?
153- Compiler correctness: the binaries produced by the compiler should do what
154 the input programs says they do, and should continue to do so despite the
155 tremendous amount of change constantly going on.
156- Integration: a number of other tools need to use the compiler in
157 various ways (e.g. cargo, clippy, miri, RLS) that must be supported.
158- Compiler stability: the compiler should not crash or fail ungracefully on the
159 stable channel.
160- Rust stability: the compiler must respect Rust's stability guarantees by not
161 breaking programs that previously compiled despite the many changes that are
162 always going on to its implementation.
163- Limitations of other tools: rustc uses LLVM in its backend, and LLVM has some
164 strengths we leverage and some limitations/weaknesses we need to work around.
165
166So, as you read through the rest of the guide, keep these things in mind. They
167will often inform decisions that we make.
168
169### Intermediate representations
170
171As with most compilers, `rustc` uses some intermediate representations (IRs) to
172facilitate computations. In general, working directly with the source code is
173extremely inconvenient and error-prone. Source code is designed to be human-friendly while at
174the same time being unambiguous, but it's less convenient for doing something
175like, say, type checking.
176
177Instead most compilers, including `rustc`, build some sort of IR out of the
178source code which is easier to analyze. `rustc` has a few IRs, each optimized
179for different purposes:
180
181- Token stream: the lexer produces a stream of tokens directly from the source
182 code. This stream of tokens is easier for the parser to deal with than raw
183 text.
184- Abstract Syntax Tree (AST): the abstract syntax tree is built from the stream
185 of tokens produced by the lexer. It represents
186 pretty much exactly what the user wrote. It helps to do some syntactic sanity
187 checking (e.g. checking that a type is expected where the user wrote one).
188- High-level IR (HIR): This is a sort of desugared AST. It's still close
189 to what the user wrote syntactically, but it includes some implicit things
190 such as some elided lifetimes, etc. This IR is amenable to type checking.
191- Typed HIR (THIR): This is an intermediate between HIR and MIR, and used to be called
192 High-level Abstract IR (HAIR). It is like the HIR but it is fully typed and a bit
193 more desugared (e.g. method calls and implicit dereferences are made fully explicit).
194 Moreover, it is easier to lower to MIR from THIR than from HIR.
195- Middle-level IR (MIR): This IR is basically a Control-Flow Graph (CFG). A CFG
196 is a type of diagram that shows the basic blocks of a program and how control
197 flow can go between them. Likewise, MIR also has a bunch of basic blocks with
198 simple typed statements inside them (e.g. assignment, simple computations,
199 etc) and control flow edges to other basic blocks (e.g., calls, dropping
200 values). MIR is used for borrow checking and other
201 important dataflow-based checks, such as checking for uninitialized values.
202 It is also used for a series of optimizations and for constant evaluation (via
203 MIRI). Because MIR is still generic, we can do a lot of analyses here more
204 efficiently than after monomorphization.
205- LLVM IR: This is the standard form of all input to the LLVM compiler. LLVM IR
206 is a sort of typed assembly language with lots of annotations. It's
207 a standard format that is used by all compilers that use LLVM (e.g. the clang
208 C compiler also outputs LLVM IR). LLVM IR is designed to be easy for other
209 compilers to emit and also rich enough for LLVM to run a bunch of
210 optimizations on it.
211
212One other thing to note is that many values in the compiler are _interned_.
213This is a performance and memory optimization in which we allocate the values
214in a special allocator called an _arena_. Then, we pass around references to
215the values allocated in the arena. This allows us to make sure that identical
216values (e.g. types in your program) are only allocated once and can be compared
217cheaply by comparing pointers. Many of the intermediate representations are
218interned.
219
220### Queries
221
222The first big implementation choice is the _query_ system. The rust compiler
223uses a query system which is unlike most textbook compilers, which are
224organized as a series of passes over the code that execute sequentially. The
225compiler does this to make incremental compilation possible -- that is, if the
226user makes a change to their program and recompiles, we want to do as little
227redundant work as possible to produce the new binary.
228
229In `rustc`, all the major steps above are organized as a bunch of queries that
230call each other. For example, there is a query to ask for the type of something
231and another to ask for the optimized MIR of a function. These
232queries can call each other and are all tracked through the query system.
233The results of the queries are cached on disk so that we can tell which
234queries' results changed from the last compilation and only redo those. This is
235how incremental compilation works.
236
237In principle, for the query-fied steps, we do each of the above for each item
238individually. For example, we will take the HIR for a function and use queries
239to ask for the LLVM IR for that HIR. This drives the generation of optimized
240MIR, which drives the borrow checker, which drives the generation of MIR, and
241so on.
242
243... except that this is very over-simplified. In fact, some queries are not
244cached on disk, and some parts of the compiler have to run for all code anyway
245for correctness even if the code is dead code (e.g. the borrow checker). For
246example, [currently the `mir_borrowck` query is first executed on all functions
247of a crate.][passes] Then the codegen backend invokes the
248`collect_and_partition_mono_items` query, which first recursively requests the
249`optimized_mir` for all reachable functions, which in turn runs `mir_borrowck`
250for that function and then creates codegen units. This kind of split will need
251to remain to ensure that unreachable functions still have their errors emitted.
252
253[passes]: https://github.com/rust-lang/rust/blob/45ebd5808afd3df7ba842797c0fcd4447ddf30fb/src/librustc_interface/passes.rs#L824
254
255Moreover, the compiler wasn't originally built to use a query system; the query
256system has been retrofitted into the compiler, so parts of it are not query-fied
257yet. Also, LLVM isn't our code, so that isn't querified either. The plan is to
258eventually query-fy all of the steps listed in the previous section,
259but as of <!-- date: 2021-02 --> February 2021, only the steps between HIR and
260LLVM IR are query-fied. That is, lexing, parsing, name resolution, and macro
261expansion are done all at once for the whole program.
262
263One other thing to mention here is the all-important "typing context",
264[`TyCtxt`], which is a giant struct that is at the center of all things.
265(Note that the name is mostly historic. This is _not_ a "typing context" in the
266sense of `Γ` or `Δ` from type theory. The name is retained because that's what
267the name of the struct is in the source code.) All
268queries are defined as methods on the [`TyCtxt`] type, and the in-memory query
269cache is stored there too. In the code, there is usually a variable called
270`tcx` which is a handle on the typing context. You will also see lifetimes with
271the name `'tcx`, which means that something is tied to the lifetime of the
272`TyCtxt` (usually it is stored or interned there).
273
274[`TyCtxt`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.TyCtxt.html
275
276### `ty::Ty`
277
278Types are really important in Rust, and they form the core of a lot of compiler
279analyses. The main type (in the compiler) that represents types (in the user's
280program) is [`rustc_middle::ty::Ty`][ty]. This is so important that we have a whole chapter
281on [`ty::Ty`][ty], but for now, we just want to mention that it exists and is the way
282`rustc` represents types!
283
284Also note that the `rustc_middle::ty` module defines the `TyCtxt` struct we mentioned before.
285
286[ty]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/type.Ty.html
287
288### Parallelism
289
290Compiler performance is a problem that we would like to improve on
291(and are always working on). One aspect of that is parallelizing
292`rustc` itself.
293
294Currently, there is only one part of rustc that is already parallel: codegen.
295During monomorphization, the compiler will split up all the code to be
296generated into smaller chunks called _codegen units_. These are then generated
297by independent instances of LLVM. Since they are independent, we can run them
298in parallel. At the end, the linker is run to combine all the codegen units
299together into one binary.
300
301However, the rest of the compiler is still not yet parallel. There have been
302lots of efforts spent on this, but it is generally a hard problem. The current
303approach is to turn `RefCell`s into `Mutex`s -- that is, we
304switch to thread-safe internal mutability. However, there are ongoing
305challenges with lock contention, maintaining query-system invariants under
306concurrency, and the complexity of the code base. One can try out the current
307work by enabling parallel compilation in `config.toml`. It's still early days,
308but there are already some promising performance improvements.
309
310### Bootstrapping
311
312`rustc` itself is written in Rust. So how do we compile the compiler? We use an
313older compiler to compile the newer compiler. This is called [_bootstrapping_].
314
315Bootstrapping has a lot of interesting implications. For example, it means
316that one of the major users of Rust is the Rust compiler, so we are
317constantly testing our own software ("eating our own dogfood").
318
319For more details on bootstrapping, see
320[the bootstrapping section of the guide][rustc-bootstrap].
321
322[_bootstrapping_]: https://en.wikipedia.org/wiki/Bootstrapping_(compilers)
323[rustc-bootstrap]: building/bootstrapping.md
324
325# Unresolved Questions
326
327- Does LLVM ever do optimizations in debug builds?
328- How do I explore phases of the compile process in my own sources (lexer,
329 parser, HIR, etc)? - e.g., `cargo rustc -- -Z unpretty=hir-tree` allows you to
330 view HIR representation
331- What is the main source entry point for `X`?
332- Where do phases diverge for cross-compilation to machine code across
333 different platforms?
334
335# References
336
337- Command line parsing
338 - Guide: [The Rustc Driver and Interface](https://rustc-dev-guide.rust-lang.org/rustc-driver.html)
339 - Driver definition: [`rustc_driver`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_driver/)
340 - Main entry point: [`rustc_session::config::build_session_options`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_session/config/fn.build_session_options.html)
341- Lexical Analysis: Lex the user program to a stream of tokens
342 - Guide: [Lexing and Parsing](https://rustc-dev-guide.rust-lang.org/the-parser.html)
343 - Lexer definition: [`rustc_lexer`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/index.html)
344 - Main entry point: [`rustc_lexer::first_token`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/fn.first_token.html)
345- Parsing: Parse the stream of tokens to an Abstract Syntax Tree (AST)
346 - Guide: [Lexing and Parsing](https://rustc-dev-guide.rust-lang.org/the-parser.html)
347 - Parser definition: [`rustc_parse`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html)
348 - Main entry points:
349 - [Entry point for first file in crate](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_interface/passes/fn.parse.html)
350 - [Entry point for outline module parsing](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_expand/module/fn.parse_external_mod.html)
351 - [Entry point for macro fragments][parse_nonterminal]
352 - AST definition: [`rustc_ast`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/ast/index.html)
353 - Expansion: **TODO**
354 - Name Resolution: **TODO**
355 - Feature gating: **TODO**
356 - Early linting: **TODO**
357- The High Level Intermediate Representation (HIR)
358 - Guide: [The HIR](https://rustc-dev-guide.rust-lang.org/hir.html)
359 - Guide: [Identifiers in the HIR](https://rustc-dev-guide.rust-lang.org/hir.html#identifiers-in-the-hir)
360 - Guide: [The HIR Map](https://rustc-dev-guide.rust-lang.org/hir.html#the-hir-map)
361 - Guide: [Lowering AST to HIR](https://rustc-dev-guide.rust-lang.org/lowering.html)
362 - How to view HIR representation for your code `cargo rustc -- -Z unpretty=hir-tree`
363 - Rustc HIR definition: [`rustc_hir`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_hir/index.html)
364 - Main entry point: **TODO**
365 - Late linting: **TODO**
366- Type Inference
367 - Guide: [Type Inference](https://rustc-dev-guide.rust-lang.org/type-inference.html)
368 - Guide: [The ty Module: Representing Types](https://rustc-dev-guide.rust-lang.org/ty.html) (semantics)
369 - Main entry point (type inference): [`InferCtxtBuilder::enter`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_infer/infer/struct.InferCtxtBuilder.html#method.enter)
370 - Main entry point (type checking bodies): [the `typeck` query](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.TyCtxt.html#method.typeck)
371 - These two functions can't be decoupled.
372- The Mid Level Intermediate Representation (MIR)
373 - Guide: [The MIR (Mid level IR)](https://rustc-dev-guide.rust-lang.org/mir/index.html)
374 - Definition: [`rustc_middle/src/mir`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/mir/index.html)
375 - Definition of source that manipulates the MIR: [`rustc_mir`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_mir/index.html)
376- The Borrow Checker
377 - Guide: [MIR Borrow Check](https://rustc-dev-guide.rust-lang.org/borrow_check.html)
378 - Definition: [`rustc_mir/borrow_check`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_mir/borrow_check/index.html)
379 - Main entry point: [`mir_borrowck` query](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_mir/borrow_check/fn.mir_borrowck.html)
380- MIR Optimizations
381 - Guide: [MIR Optimizations](https://rustc-dev-guide.rust-lang.org/mir/optimizations.html)
382 - Definition: [`rustc_mir/transform`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_mir/transform/index.html)
383 - Main entry point: [`optimized_mir` query](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_mir/transform/fn.optimized_mir.html)
384- Code Generation
385 - Guide: [Code Generation](https://rustc-dev-guide.rust-lang.org/backend/codegen.html)
386 - Generating Machine Code from LLVM IR with LLVM - **TODO: reference?**
387 - Main entry point: [`rustc_codegen_ssa::base::codegen_crate`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/base/fn.codegen_crate.html)
388 - This monomorphizes and produces LLVM IR for one codegen unit. It then
389 starts a background thread to run LLVM, which must be joined later.
390 - Monomorphization happens lazily via [`FunctionCx::monomorphize`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/mir/struct.FunctionCx.html#method.monomorphize) and [`rustc_codegen_ssa::base::codegen_instance `](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/base/fn.codegen_instance.html)