src/doc/rustc-dev-guide/src/appendix/background.md

   1 # Background topics
   2
   3 This section covers a numbers of common compiler terms that arise in
   4 this guide. We try to give the general definition while providing some
   5 Rust-specific context.
   6
   7 <a name="cfg"></a>
   8
   9 ## What is a control-flow graph?
  10
  11 A control-flow graph (CFG) is a common term from compilers. If you've ever
  12 used a flow-chart, then the concept of a control-flow graph will be
  13 pretty familiar to you. It's a representation of your program that
  14 clearly exposes the underlying control flow.
  15
  16 A control-flow graph is structured as a set of **basic blocks**
  17 connected by edges. The key idea of a basic block is that it is a set
  18 of statements that execute "together" – that is, whenever you branch
  19 to a basic block, you start at the first statement and then execute
  20 all the remainder. Only at the end of the block is there the
  21 possibility of branching to more than one place (in MIR, we call that
  22 final statement the **terminator**):
  23
  24 ```mir
  25 bb0: {
  26     statement0;
  27     statement1;
  28     statement2;
  29     ...
  30     terminator;
  31 }
  32 ```
  33
  34 Many expressions that you are used to in Rust compile down to multiple
  35 basic blocks. For example, consider an if statement:
  36
  37 ```rust,ignore
  38 a = 1;
  39 if some_variable {
  40     b = 1;
  41 } else {
  42     c = 1;
  43 }
  44 d = 1;
  45 ```
  46
  47 This would compile into four basic blocks in MIR. In textual form, it looks like
  48 this:
  49
  50 ```mir
  51 BB0: {
  52     a = 1;
  53     if some_variable {
  54         goto BB1;
  55     } else {
  56         goto BB2;
  57     }
  58 }
  59
  60 BB1: {
  61     b = 1;
  62     goto BB3;
  63 }
  64
  65 BB2: {
  66     c = 1;
  67     goto BB3;
  68 }
  69
  70 BB3: {
  71     d = 1;
  72     ...
  73 }
  74 ```
  75
  76 In graphical form, it looks like this:
  77
  78 ```
  79                 BB0
  80        +--------------------+
  81        | a = 1;             |
  82        +--------------------+
  83              /       \
  84   if some_variable   else
  85            /           \
  86      BB1  /             \  BB2
  87     +-----------+   +-----------+
  88     | b = 1;    |   | c = 1;    |
  89     +-----------+   +-----------+
  90             \          /
  91              \        /
  92               \ BB3  /
  93             +----------+
  94             | d = 1;   |
  95             | ...      |
  96             +----------+
  97 ```
  98
  99 When using a control-flow graph, a loop simply appears as a cycle in
 100 the graph, and the `break` keyword translates into a path out of that
 101 cycle.
 102
 103 <a name="dataflow"></a>
 104
 105 ## What is a dataflow analysis?
 106
 107 [*Static Program Analysis*](https://cs.au.dk/~amoeller/spa/) by Anders Møller
 108 and Michael I. Schwartzbach is an incredible resource!
 109
 110 _Dataflow analysis_ is a type of static analysis that is common in many
 111 compilers. It describes a general technique, rather than a particular analysis.
 112
 113 The basic idea is that we can walk over a [control-flow graph (CFG)](#cfg) and
 114 keep track of what some value could be. At the end of the walk, we might have
 115 shown that some claim is true or not necessarily true (e.g. "this variable must
 116 be initialized"). `rustc` tends to do dataflow analyses over the MIR, since MIR
 117 is already a CFG.
 118
 119 For example, suppose we want to check that `x` is initialized before it is used
 120 in this snippet:
 121
 122 ```rust,ignore
 123 fn foo() {
 124     let mut x;
 125
 126     if some_cond {
 127         x = 1;
 128     }
 129
 130     dbg!(x);
 131 }
 132 ```
 133
 134 A CFG for this code might look like this:
 135
 136 ```txt
 137  +------+
 138  | Init | (A)
 139  +------+
 140     |   \
 141     |   if some_cond
 142   else    \ +-------+
 143     |      \| x = 1 | (B)
 144     |       +-------+
 145     |      /
 146  +---------+
 147  | dbg!(x) | (C)
 148  +---------+
 149 ```
 150
 151 We can do the dataflow analysis as follows: we will start off with a flag `init`
 152 which indicates if we know `x` is initialized. As we walk the CFG, we will
 153 update the flag. At the end, we can check its value.
 154
 155 So first, in block (A), the variable `x` is declared but not initialized, so
 156 `init = false`. In block (B), we initialize the value, so we know that `x` is
 157 initialized. So at the end of (B), `init = true`.
 158
 159 Block (C) is where things get interesting. Notice that there are two incoming
 160 edges, one from (A) and one from (B), corresponding to whether `some_cond` is true or not.
 161 But we cannot know that! It could be the case the `some_cond` is always true,
 162 so that `x` is actually always initialized. It could also be the case that
 163 `some_cond` depends on something random (e.g. the time), so `x` may not be
 164 initialized. In general, we cannot know statically (due to [Rice's
 165 Theorem][rice]).  So what should the value of `init` be in block (C)?
 166
 167 [rice]: https://en.wikipedia.org/wiki/Rice%27s_theorem
 168
 169 Generally, in dataflow analyses, if a block has multiple parents (like (C) in
 170 our example), its dataflow value will be some function of all its parents (and
 171 of course, what happens in (C)).  Which function we use depends on the analysis
 172 we are doing.
 173
 174 In this case, we want to be able to prove definitively that `x` must be
 175 initialized before use. This forces us to be conservative and assume that
 176 `some_cond` might be false sometimes. So our "merging function" is "and". That
 177 is, `init = true` in (C) if `init = true` in (A) _and_ in (B) (or if `x` is
 178 initialized in (C)). But this is not the case; in particular, `init = false` in
 179 (A), and `x` is not initialized in (C).  Thus, `init = false` in (C); we can
 180 report an error that "`x` may not be initialized before use".
 181
 182 There is definitely a lot more that can be said about dataflow analyses. There is an
 183 extensive body of research literature on the topic, including a lot of theory.
 184 We only discussed a forwards analysis, but backwards dataflow analysis is also
 185 useful. For example, rather than starting from block (A) and moving forwards,
 186 we might have started with the usage of `x` and moved backwards to try to find
 187 its initialization.
 188
 189 <a name="quantified"></a>
 190
 191 ## What is "universally quantified"? What about "existentially quantified"?
 192
 193 In math, a predicate may be _universally quantified_ or _existentially
 194 quantified_:
 195
 196 - _Universal_ quantification:
 197   - the predicate holds if it is true for all possible inputs.
 198   - Traditional notation: ∀x: P(x). Read as "for all x, P(x) holds".
 199 - _Existential_ quantification:
 200   - the predicate holds if there is any input where it is true, i.e., there
 201     only has to be a single input.
 202   - Traditional notation: ∃x: P(x). Read as "there exists x such that P(x) holds".
 203
 204 In Rust, they come up in type checking and trait solving. For example,
 205
 206 ```rust,ignore
 207 fn foo<T>()
 208 ```
 209 This function claims that the function is well-typed for all types `T`: `∀ T: well_typed(foo)`.
 210
 211 Another example:
 212
 213 ```rust,ignore
 214 fn foo<'a>(_: &'a usize)
 215 ```
 216 This function claims that for any lifetime `'a` (determined by the
 217 caller), it is well-typed: `∀ 'a: well_typed(foo)`.
 218
 219 Another example:
 220
 221 ```rust,ignore
 222 fn foo<F>()
 223 where for<'a> F: Fn(&'a u8)
 224 ```
 225 This function claims that it is well-typed for all types `F` such that for all
 226 lifetimes `'a`, `F: Fn(&'a u8)`: `∀ F: ∀ 'a: (F: Fn(&'a u8)) => well_typed(foo)`.
 227
 228 One more example:
 229
 230 ```rust,ignore
 231 fn foo(_: dyn Debug)
 232 ```
 233 This function claims that there exists some type `T` that implements `Debug`
 234 such that the function is well-typed: `∃ T:  (T: Debug) and well_typed(foo)`.
 235
 236 <a name="variance"></a>
 237
 238 ## What is a de Bruijn Index?
 239
 240 [De Bruijn indices][wikideb] are a way of representing, using only integers,
 241 which variables are bound in which binders. They were originally invented for
 242 use in lambda calculus evaluation (see [this Wikipedia article][wikideb] for
 243 more). In `rustc`, we use de Bruijn indices to [represent generic types][sub].
 244
 245 [wikideb]: https://en.wikipedia.org/wiki/De_Bruijn_index
 246 [sub]: ../generics.md
 247
 248 Here is a basic example of how de Bruijn indices might be used for closures (we
 249 don't actually do this in `rustc` though!):
 250
 251 ```rust,ignore
 252 |x| {
 253     f(x) // de Bruijn index of `x` is 1 because `x` is bound 1 level up
 254
 255     |y| {
 256         g(x, y) // index of `x` is 2 because it is bound 2 levels up
 257                 // index of `y` is 1 because it is bound 1 level up
 258     }
 259 }
 260 ```
 261
 262 ## What are co- and contra-variance?
 263
 264 Check out the subtyping chapter from the
 265 [Rust Nomicon](https://doc.rust-lang.org/nomicon/subtyping.html).
 266
 267 See the [variance](../variance.html) chapter of this guide for more info on how
 268 the type checker handles variance.
 269
 270 <a name="free-vs-bound"></a>
 271
 272 ## What is a "free region" or a "free variable"? What about "bound region"?
 273
 274 Let's describe the concepts of free vs bound in terms of program
 275 variables, since that's the thing we're most familiar with.
 276
 277 - Consider this expression, which creates a closure: `|a, b| a + b`.
 278   Here, the `a` and `b` in `a + b` refer to the arguments that the closure will
 279   be given when it is called. We say that the `a` and `b` there are **bound** to
 280   the closure, and that the closure signature `|a, b|` is a **binder** for the
 281   names `a` and `b` (because any references to `a` or `b` within refer to the
 282   variables that it introduces).
 283 - Consider this expression: `a + b`. In this expression, `a` and `b` refer to
 284   local variables that are defined *outside* of the expression. We say that
 285   those variables **appear free** in the expression (i.e., they are **free**,
 286   not **bound** (tied up)).
 287
 288 So there you have it: a variable "appears free" in some
 289 expression/statement/whatever if it refers to something defined
 290 outside of that expressions/statement/whatever. Equivalently, we can
 291 then refer to the "free variables" of an expression – which is just
 292 the set of variables that "appear free".
 293
 294 So what does this have to do with regions? Well, we can apply the
 295 analogous concept to type and regions. For example, in the type `&'a
 296 u32`, `'a` appears free.  But in the type `for<'a> fn(&'a u32)`, it
 297 does not.
 298
 299 # Further Reading About Compilers
 300
 301 > Thanks to `mem`, `scottmcm`, and `Levi` on the official Discord for the
 302 > recommendations, and to `tinaun` for posting a link to a [twitter thread from
 303 > Graydon Hoare](https://twitter.com/graydon_pub/status/1039615569132118016)
 304 > which had some more recommendations!
 305 >
 306 > Other sources: https://gcc.gnu.org/wiki/ListOfCompilerBooks
 307 >
 308 > If you have other suggestions, please feel free to open an issue or PR.
 309
 310 ## Books
 311 - [Types and Programming Languages](https://www.cis.upenn.edu/~bcpierce/tapl/)
 312 - [Programming Language Pragmatics](https://www.cs.rochester.edu/~scott/pragmatics/)
 313 - [Practical Foundations for Programming Languages](https://www.cs.cmu.edu/~rwh/pfpl/2nded.pdf)
 314 - [Compilers: Principles, Techniques, and Tools, 2nd Edition](https://www.pearson.com/us/higher-education/program/Aho-Compilers-Principles-Techniques-and-Tools-2nd-Edition/PGM167067.html)
 315 - [Garbage Collection: Algorithms for Automatic Dynamic Memory Management](https://www.cs.kent.ac.uk/people/staff/rej/gcbook/)
 316 - [Linkers and Loaders](https://www.amazon.com/Linkers-Kaufmann-Software-Engineering-Programming/dp/1558604960) (There are also free versions of this, but the version we had linked seems to be offline at the moment.)
 317 - [Advanced Compiler Design and Implementation](https://www.goodreads.com/book/show/887908.Advanced_Compiler_Design_and_Implementation)
 318 - [Building an Optimizing Compiler](https://www.goodreads.com/book/show/2063103.Building_an_Optimizing_Compiler)
 319 - [Crafting Interpreters](http://www.craftinginterpreters.com/)
 320
 321 ## Courses
 322 - [University of Oregon Programming Languages Summer School archive](https://www.cs.uoregon.edu/research/summerschool/archives.html)
 323
 324 ## Wikis
 325 - [Wikipedia](https://en.wikipedia.org/wiki/List_of_programming_languages_by_type)
 326 - [Esoteric Programming Languages](https://esolangs.org/wiki/Main_Page)
 327 - [Stanford Encyclopedia of Philosophy](https://plato.stanford.edu/index.html)
 328 - [nLab](https://ncatlab.org/nlab/show/HomePage)
 329
 330 ## Misc Papers and Blog Posts
 331 - [Programming in Martin-Löf's Type Theory](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.6683&rep=rep1&type=pdf)
 332 - [Polymorphism, Subtyping, and Type Inference in MLsub](https://dl.acm.org/doi/10.1145/3093333.3009882)