[rustc.git] / src / doc / rustc-dev-guide / src / queries / incremental-compilation.md

# Incremental compilation

<!-- toc -->

The incremental compilation scheme is, in essence, a surprisingly
simple extension to the overall query system. We'll start by describing
a slightly simplified variant of the real thing – the "basic algorithm" –
and then describe some possible improvements.

## The basic algorithm

The basic algorithm is
called the **red-green** algorithm[^salsa]. The high-level idea is
that, after each run of the compiler, we will save the results of all
the queries that we do, as well as the **query DAG**. The
**query DAG** is a [DAG] that indexes which queries executed which
other queries. So, for example, there would be an [edge] from a query Q1
to another query Q2 if computing Q1 required computing Q2 (note that
because queries cannot depend on themselves, this results in a DAG and
not a general graph).

[DAG]: https://en.wikipedia.org/wiki/Directed_acyclic_graph

On the next run of the compiler, then, we can sometimes reuse these
query results to avoid re-executing a query. We do this by assigning
every query a **color**:

- If a query is colored **red**, that means that its result during
  this compilation has **changed** from the previous compilation.
- If a query is colored **green**, that means that its result is
  the **same** as the previous compilation.

There are two key insights here:

- First, if all the inputs to query Q are colored green, then the
  query Q **must** result in the same value as last time and hence
  need not be re-executed (or else the compiler is not deterministic).
- Second, even if some inputs to a query changes, it may be that it
  **still** produces the same result as the previous compilation. In
  particular, the query may only use part of its input.
  - Therefore, after executing a query, we always check whether it
    produced the same result as the previous time. **If it did,** we
    can still mark the query as green, and hence avoid re-executing
    dependent queries.

### The try-mark-green algorithm

At the core of incremental compilation is an algorithm called
"try-mark-green". It has the job of determining the color of a given
query Q (which must not have yet been executed). In cases where Q has
red inputs, determining Q's color may involve re-executing Q so that
we can compare its output, but if all of Q's inputs are green, then we
can conclude that Q must be green without re-executing it or inspecting
its value at all. In the compiler, this allows us to avoid
deserializing the result from disk when we don't need it, and in fact
enables us to sometimes skip *serializing* the result as well
(see the refinements section below).

Try-mark-green works as follows:

- First check if the query Q was executed during the previous compilation.
  - If not, we can just re-execute the query as normal, and assign it the
    color of red.
- If yes, then load the 'dependent queries' of Q.
- If there is a saved result, then we load the `reads(Q)` vector from the
  query DAG. The "reads" is the set of queries that Q executed during
  its execution.
  - For each query R in `reads(Q)`, we recursively demand the color
    of R using try-mark-green.
    - Note: it is important that we visit each node in `reads(Q)` in same order
      as they occurred in the original compilation. See [the section on the
      query DAG below](#dag).
    - If **any** of the nodes in `reads(Q)` wind up colored **red**, then Q is
      dirty.
      - We re-execute Q and compare the hash of its result to the hash of the
        result from the previous compilation.
      - If the hash has not changed, we can mark Q as **green** and return.
    - Otherwise, **all** of the nodes in `reads(Q)` must be **green**. In that
      case, we can color Q as **green** and return.

<a name="dag"></a>

### The query DAG

The query DAG code is stored in
[`compiler/rustc_middle/src/dep_graph`][dep_graph]. Construction of the DAG is done
by instrumenting the query execution.

One key point is that the query DAG also tracks ordering; that is, for
each query Q, we not only track the queries that Q reads, we track the
**order** in which they were read.  This allows try-mark-green to walk
those queries back in the same order. This is important because once a
subquery comes back as red, we can no longer be sure that Q will continue
along the same path as before. That is, imagine a query like this:

```rust,ignore
fn main_query(tcx) {
    if tcx.subquery1() {
        tcx.subquery2()
    } else {
        tcx.subquery3()
    }
}
```

Now imagine that in the first compilation, `main_query` starts by
executing `subquery1`, and this returns true. In that case, the next
query `main_query` executes will be `subquery2`, and `subquery3` will
not be executed at all.

But now imagine that in the **next** compilation, the input has
changed such that `subquery1` returns **false**. In this case, `subquery2`
would never execute. If try-mark-green were to visit `reads(main_query)` out
of order, however, it might visit `subquery2` before `subquery1`, and hence
execute it.
This can lead to ICEs and other problems in the compiler.

[dep_graph]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/dep_graph/index.html

## Improvements to the basic algorithm

In the description of the basic algorithm, we said that at the end of
compilation we would save the results of all the queries that were
performed.  In practice, this can be quite wasteful – many of those
results are very cheap to recompute, and serializing and deserializing
them is not a particular win. In practice, what we would do is to save
**the hashes** of all the subqueries that we performed. Then, in select cases,
we **also** save the results.

This is why the incremental algorithm separates computing the
**color** of a node, which often does not require its value, from
computing the **result** of a node. Computing the result is done via a simple
algorithm like so:

- Check if a saved result for Q is available. If so, compute the color of Q.
  If Q is green, deserialize and return the saved result.
- Otherwise, execute Q.
  - We can then compare the hash of the result and color Q as green if
    it did not change.

## Resources
The initial design document can be found [here][initial-design], which expands
on the memoization details, provides more high-level overview and motivation
for this system.

# Footnotes

[^salsa]: I have long wanted to rename it to the Salsa algorithm, but it never caught on. -@nikomatsakis

[edge]: https://en.wikipedia.org/wiki/Glossary_of_graph_theory_terms#edge
[initial-design]: https://github.com/nikomatsakis/rustc-on-demand-incremental-design-doc/blob/master/0000-rustc-on-demand-and-incremental.md
Commit	Line	Data
a1dfa0c6 XL	1	# Incremental compilation
a1dfa0c6 XL	2
6a06907d XL	3	<!-- toc -->
6a06907d XL	4
a1dfa0c6 XL	5	The incremental compilation scheme is, in essence, a surprisingly
	6	simple extension to the overall query system. We'll start by describing
	7	a slightly simplified variant of the real thing – the "basic algorithm" –
	8	and then describe some possible improvements.
	9
	10	## The basic algorithm
	11
	12	The basic algorithm is
	13	called the red-green algorithm[^salsa]. The high-level idea is
	14	that, after each run of the compiler, we will save the results of all
	15	the queries that we do, as well as the query DAG. The
	16	query DAG is a [DAG] that indexes which queries executed which
6a06907d	17	other queries. So, for example, there would be an [edge] from a query Q1
a1dfa0c6 XL	18	to another query Q2 if computing Q1 required computing Q2 (note that
	19	because queries cannot depend on themselves, this results in a DAG and
	20	not a general graph).
	21
	22	[DAG]: https://en.wikipedia.org/wiki/Directed_acyclic_graph
	23
	24	On the next run of the compiler, then, we can sometimes reuse these
	25	query results to avoid re-executing a query. We do this by assigning
	26	every query a color:
	27
	28	- If a query is colored red, that means that its result during
	29	this compilation has changed from the previous compilation.
	30	- If a query is colored green, that means that its result is
	31	the same as the previous compilation.
	32
	33	There are two key insights here:
	34
	35	- First, if all the inputs to query Q are colored green, then the
	36	query Q must result in the same value as last time and hence
	37	need not be re-executed (or else the compiler is not deterministic).
	38	- Second, even if some inputs to a query changes, it may be that it
	39	still produces the same result as the previous compilation. In
	40	particular, the query may only use part of its input.
	41	- Therefore, after executing a query, we always check whether it
	42	produced the same result as the previous time. If it did, we
	43	can still mark the query as green, and hence avoid re-executing
	44	dependent queries.
	45
	46	### The try-mark-green algorithm
	47
	48	At the core of incremental compilation is an algorithm called
	49	"try-mark-green". It has the job of determining the color of a given
	50	query Q (which must not have yet been executed). In cases where Q has
	51	red inputs, determining Q's color may involve re-executing Q so that
	52	we can compare its output, but if all of Q's inputs are green, then we
	53	can conclude that Q must be green without re-executing it or inspecting
	54	its value at all. In the compiler, this allows us to avoid
	55	deserializing the result from disk when we don't need it, and in fact
	56	enables us to sometimes skip serializing the result as well
	57	(see the refinements section below).
	58
	59	Try-mark-green works as follows:
	60
	61	- First check if the query Q was executed during the previous compilation.
	62	- If not, we can just re-execute the query as normal, and assign it the
	63	color of red.
	64	- If yes, then load the 'dependent queries' of Q.
	65	- If there is a saved result, then we load the `reads(Q)` vector from the
	66	query DAG. The "reads" is the set of queries that Q executed during
	67	its execution.
	68	- For each query R in `reads(Q)`, we recursively demand the color
	69	of R using try-mark-green.
	70	- Note: it is important that we visit each node in `reads(Q)` in same order
	71	as they occurred in the original compilation. See [the section on the
	72	query DAG below](#dag).
	73	- If any of the nodes in `reads(Q)` wind up colored red, then Q is
	74	dirty.
	75	- We re-execute Q and compare the hash of its result to the hash of the
	76	result from the previous compilation.
	77	- If the hash has not changed, we can mark Q as green and return.
	78	- Otherwise, all of the nodes in `reads(Q)` must be green. In that
	79	case, we can color Q as green and return.
	80
	81	<a name="dag"></a>
82
83	### The query DAG
84
85	The query DAG code is stored in
6a06907d	86	[`compiler/rustc_middle/src/dep_graph`][dep_graph]. Construction of the DAG is done
a1dfa0c6 XL	87	by instrumenting the query execution.
	88
	89	One key point is that the query DAG also tracks ordering; that is, for
	90	each query Q, we not only track the queries that Q reads, we track the
	91	order in which they were read. This allows try-mark-green to walk
	92	those queries back in the same order. This is important because once a
	93	subquery comes back as red, we can no longer be sure that Q will continue
	94	along the same path as before. That is, imagine a query like this:
	95
	96	```rust,ignore
	97	fn main_query(tcx) {
	98	if tcx.subquery1() {
	99	tcx.subquery2()
	100	} else {
	101	tcx.subquery3()
	102	}
	103	}
	104	```
	105
	106	Now imagine that in the first compilation, `main_query` starts by
	107	executing `subquery1`, and this returns true. In that case, the next
	108	query `main_query` executes will be `subquery2`, and `subquery3` will
	109	not be executed at all.
	110
	111	But now imagine that in the next compilation, the input has
	112	changed such that `subquery1` returns false. In this case, `subquery2`
	113	would never execute. If try-mark-green were to visit `reads(main_query)` out
	114	of order, however, it might visit `subquery2` before `subquery1`, and hence
	115	execute it.
	116	This can lead to ICEs and other problems in the compiler.
	117
6a06907d	118	[dep_graph]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/dep_graph/index.html
a1dfa0c6 XL	119
	120	## Improvements to the basic algorithm
	121
	122	In the description of the basic algorithm, we said that at the end of
	123	compilation we would save the results of all the queries that were
	124	performed. In practice, this can be quite wasteful – many of those
	125	results are very cheap to recompute, and serializing and deserializing
	126	them is not a particular win. In practice, what we would do is to save
	127	the hashes of all the subqueries that we performed. Then, in select cases,
	128	we also save the results.
	129
	130	This is why the incremental algorithm separates computing the
	131	color of a node, which often does not require its value, from
	132	computing the result of a node. Computing the result is done via a simple
	133	algorithm like so:
	134
	135	- Check if a saved result for Q is available. If so, compute the color of Q.
	136	If Q is green, deserialize and return the saved result.
	137	- Otherwise, execute Q.
	138	- We can then compare the hash of the result and color Q as green if
	139	it did not change.
	140
532ac7d7	141	## Resources
6a06907d XL	142	The initial design document can be found [here][initial-design], which expands
	143	on the memoization details, provides more high-level overview and motivation
	144	for this system.
532ac7d7	145
a1dfa0c6 XL	146	# Footnotes
	147
	148	[^salsa]: I have long wanted to rename it to the Salsa algorithm, but it never caught on. -@nikomatsakis
6a06907d XL	149
	150	[edge]: https://en.wikipedia.org/wiki/Glossary_of_graph_theory_terms#edge
	151	[initial-design]: https://github.com/nikomatsakis/rustc-on-demand-incremental-design-doc/blob/master/0000-rustc-on-demand-and-incremental.md