[rustc.git] / src / doc / rustc-dev-guide / src / queries / incremental-compilation.md

# Incremental compilation

The incremental compilation scheme is, in essence, a surprisingly
simple extension to the overall query system. We'll start by describing
a slightly simplified variant of the real thing – the "basic algorithm" –
and then describe some possible improvements.

## The basic algorithm

The basic algorithm is
called the **red-green** algorithm[^salsa]. The high-level idea is
that, after each run of the compiler, we will save the results of all
the queries that we do, as well as the **query DAG**. The
**query DAG** is a [DAG] that indexes which queries executed which
other queries. So, for example, there would be an edge from a query Q1
to another query Q2 if computing Q1 required computing Q2 (note that
because queries cannot depend on themselves, this results in a DAG and
not a general graph).

[DAG]: https://en.wikipedia.org/wiki/Directed_acyclic_graph

On the next run of the compiler, then, we can sometimes reuse these
query results to avoid re-executing a query. We do this by assigning
every query a **color**:

- If a query is colored **red**, that means that its result during
  this compilation has **changed** from the previous compilation.
- If a query is colored **green**, that means that its result is
  the **same** as the previous compilation.

There are two key insights here:

- First, if all the inputs to query Q are colored green, then the
  query Q **must** result in the same value as last time and hence
  need not be re-executed (or else the compiler is not deterministic).
- Second, even if some inputs to a query changes, it may be that it
  **still** produces the same result as the previous compilation. In
  particular, the query may only use part of its input.
  - Therefore, after executing a query, we always check whether it
    produced the same result as the previous time. **If it did,** we
    can still mark the query as green, and hence avoid re-executing
    dependent queries.

### The try-mark-green algorithm

At the core of incremental compilation is an algorithm called
"try-mark-green". It has the job of determining the color of a given
query Q (which must not have yet been executed). In cases where Q has
red inputs, determining Q's color may involve re-executing Q so that
we can compare its output, but if all of Q's inputs are green, then we
can conclude that Q must be green without re-executing it or inspecting
its value at all. In the compiler, this allows us to avoid
deserializing the result from disk when we don't need it, and in fact
enables us to sometimes skip *serializing* the result as well
(see the refinements section below).

Try-mark-green works as follows:

- First check if the query Q was executed during the previous compilation.
  - If not, we can just re-execute the query as normal, and assign it the
    color of red.
- If yes, then load the 'dependent queries' of Q.
- If there is a saved result, then we load the `reads(Q)` vector from the
  query DAG. The "reads" is the set of queries that Q executed during
  its execution.
  - For each query R in `reads(Q)`, we recursively demand the color
    of R using try-mark-green.
    - Note: it is important that we visit each node in `reads(Q)` in same order
      as they occurred in the original compilation. See [the section on the
      query DAG below](#dag).
    - If **any** of the nodes in `reads(Q)` wind up colored **red**, then Q is
      dirty.
      - We re-execute Q and compare the hash of its result to the hash of the
        result from the previous compilation.
      - If the hash has not changed, we can mark Q as **green** and return.
    - Otherwise, **all** of the nodes in `reads(Q)` must be **green**. In that
      case, we can color Q as **green** and return.

<a name="dag"></a>

### The query DAG

The query DAG code is stored in
[`src/librustc_middle/dep_graph`][dep_graph]. Construction of the DAG is done
by instrumenting the query execution.

One key point is that the query DAG also tracks ordering; that is, for
each query Q, we not only track the queries that Q reads, we track the
**order** in which they were read.  This allows try-mark-green to walk
those queries back in the same order. This is important because once a
subquery comes back as red, we can no longer be sure that Q will continue
along the same path as before. That is, imagine a query like this:

```rust,ignore
fn main_query(tcx) {
    if tcx.subquery1() {
        tcx.subquery2()
    } else {
        tcx.subquery3()
    }
}
```

Now imagine that in the first compilation, `main_query` starts by
executing `subquery1`, and this returns true. In that case, the next
query `main_query` executes will be `subquery2`, and `subquery3` will
not be executed at all.

But now imagine that in the **next** compilation, the input has
changed such that `subquery1` returns **false**. In this case, `subquery2`
would never execute. If try-mark-green were to visit `reads(main_query)` out
of order, however, it might visit `subquery2` before `subquery1`, and hence
execute it.
This can lead to ICEs and other problems in the compiler.

[dep_graph]: https://github.com/rust-lang/rust/tree/master/src/librustc_middle/dep_graph

## Improvements to the basic algorithm

In the description of the basic algorithm, we said that at the end of
compilation we would save the results of all the queries that were
performed.  In practice, this can be quite wasteful – many of those
results are very cheap to recompute, and serializing and deserializing
them is not a particular win. In practice, what we would do is to save
**the hashes** of all the subqueries that we performed. Then, in select cases,
we **also** save the results.

This is why the incremental algorithm separates computing the
**color** of a node, which often does not require its value, from
computing the **result** of a node. Computing the result is done via a simple
algorithm like so:

- Check if a saved result for Q is available. If so, compute the color of Q.
  If Q is green, deserialize and return the saved result.
- Otherwise, execute Q.
  - We can then compare the hash of the result and color Q as green if
    it did not change.

## Resources
The initial design document can be found at https://github.com/nikomatsakis/rustc-on-demand-incremental-design-doc/blob/master/0000-rustc-on-demand-and-incremental.md, which expands on the memoization details, provides more high-level overview and motivation for this system.

# Footnotes

[^salsa]: I have long wanted to rename it to the Salsa algorithm, but it never caught on. -@nikomatsakis
Commit	Line	Data
a1dfa0c6 XL	1	# Incremental compilation
	2
	3	The incremental compilation scheme is, in essence, a surprisingly
	4	simple extension to the overall query system. We'll start by describing
	5	a slightly simplified variant of the real thing – the "basic algorithm" –
	6	and then describe some possible improvements.
	7
	8	## The basic algorithm
	9
	10	The basic algorithm is
	11	called the red-green algorithm[^salsa]. The high-level idea is
	12	that, after each run of the compiler, we will save the results of all
	13	the queries that we do, as well as the query DAG. The
	14	query DAG is a [DAG] that indexes which queries executed which
	15	other queries. So, for example, there would be an edge from a query Q1
	16	to another query Q2 if computing Q1 required computing Q2 (note that
	17	because queries cannot depend on themselves, this results in a DAG and
	18	not a general graph).
	19
	20	[DAG]: https://en.wikipedia.org/wiki/Directed_acyclic_graph
	21
	22	On the next run of the compiler, then, we can sometimes reuse these
	23	query results to avoid re-executing a query. We do this by assigning
	24	every query a color:
	25
	26	- If a query is colored red, that means that its result during
	27	this compilation has changed from the previous compilation.
	28	- If a query is colored green, that means that its result is
	29	the same as the previous compilation.
	30
	31	There are two key insights here:
	32
	33	- First, if all the inputs to query Q are colored green, then the
	34	query Q must result in the same value as last time and hence
	35	need not be re-executed (or else the compiler is not deterministic).
	36	- Second, even if some inputs to a query changes, it may be that it
	37	still produces the same result as the previous compilation. In
	38	particular, the query may only use part of its input.
	39	- Therefore, after executing a query, we always check whether it
	40	produced the same result as the previous time. If it did, we
	41	can still mark the query as green, and hence avoid re-executing
	42	dependent queries.
	43
	44	### The try-mark-green algorithm
	45
	46	At the core of incremental compilation is an algorithm called
	47	"try-mark-green". It has the job of determining the color of a given
	48	query Q (which must not have yet been executed). In cases where Q has
	49	red inputs, determining Q's color may involve re-executing Q so that
	50	we can compare its output, but if all of Q's inputs are green, then we
	51	can conclude that Q must be green without re-executing it or inspecting
	52	its value at all. In the compiler, this allows us to avoid
	53	deserializing the result from disk when we don't need it, and in fact
	54	enables us to sometimes skip serializing the result as well
	55	(see the refinements section below).
	56
	57	Try-mark-green works as follows:
	58
	59	- First check if the query Q was executed during the previous compilation.
	60	- If not, we can just re-execute the query as normal, and assign it the
	61	color of red.
	62	- If yes, then load the 'dependent queries' of Q.
	63	- If there is a saved result, then we load the `reads(Q)` vector from the
	64	query DAG. The "reads" is the set of queries that Q executed during
65	its execution.
66	- For each query R in `reads(Q)`, we recursively demand the color
67	of R using try-mark-green.
68	- Note: it is important that we visit each node in `reads(Q)` in same order
69	as they occurred in the original compilation. See [the section on the
70	query DAG below](#dag).
71	- If any of the nodes in `reads(Q)` wind up colored red, then Q is
72	dirty.
73	- We re-execute Q and compare the hash of its result to the hash of the
74	result from the previous compilation.
75	- If the hash has not changed, we can mark Q as green and return.
76	- Otherwise, all of the nodes in `reads(Q)` must be green. In that
77	case, we can color Q as green and return.
78
79	<a name="dag"></a>
80
81	### The query DAG
82
83	The query DAG code is stored in
ba9703b0	84	[`src/librustc_middle/dep_graph`][dep_graph]. Construction of the DAG is done
a1dfa0c6 XL	85	by instrumenting the query execution.
	86
	87	One key point is that the query DAG also tracks ordering; that is, for
	88	each query Q, we not only track the queries that Q reads, we track the
	89	order in which they were read. This allows try-mark-green to walk
	90	those queries back in the same order. This is important because once a
	91	subquery comes back as red, we can no longer be sure that Q will continue
	92	along the same path as before. That is, imagine a query like this:
	93
	94	```rust,ignore
	95	fn main_query(tcx) {
	96	if tcx.subquery1() {
	97	tcx.subquery2()
	98	} else {
	99	tcx.subquery3()
	100	}
	101	}
	102	```
	103
	104	Now imagine that in the first compilation, `main_query` starts by
	105	executing `subquery1`, and this returns true. In that case, the next
	106	query `main_query` executes will be `subquery2`, and `subquery3` will
	107	not be executed at all.
	108
	109	But now imagine that in the next compilation, the input has
	110	changed such that `subquery1` returns false. In this case, `subquery2`
	111	would never execute. If try-mark-green were to visit `reads(main_query)` out
	112	of order, however, it might visit `subquery2` before `subquery1`, and hence
	113	execute it.
	114	This can lead to ICEs and other problems in the compiler.
	115
ba9703b0	116	[dep_graph]: https://github.com/rust-lang/rust/tree/master/src/librustc_middle/dep_graph
a1dfa0c6 XL	117
	118	## Improvements to the basic algorithm
	119
	120	In the description of the basic algorithm, we said that at the end of
	121	compilation we would save the results of all the queries that were
	122	performed. In practice, this can be quite wasteful – many of those
	123	results are very cheap to recompute, and serializing and deserializing
	124	them is not a particular win. In practice, what we would do is to save
	125	the hashes of all the subqueries that we performed. Then, in select cases,
	126	we also save the results.
	127
	128	This is why the incremental algorithm separates computing the
	129	color of a node, which often does not require its value, from
	130	computing the result of a node. Computing the result is done via a simple
	131	algorithm like so:
	132
	133	- Check if a saved result for Q is available. If so, compute the color of Q.
	134	If Q is green, deserialize and return the saved result.
	135	- Otherwise, execute Q.
	136	- We can then compare the hash of the result and color Q as green if
	137	it did not change.
	138
532ac7d7 XL	139	## Resources
	140	The initial design document can be found at https://github.com/nikomatsakis/rustc-on-demand-incremental-design-doc/blob/master/0000-rustc-on-demand-and-incremental.md, which expands on the memoization details, provides more high-level overview and motivation for this system.
	141
a1dfa0c6 XL	142	# Footnotes
	143
	144	[^salsa]: I have long wanted to rename it to the Salsa algorithm, but it never caught on. -@nikomatsakis