]> git.proxmox.com Git - rustc.git/blame - src/doc/rustc-dev-guide/src/queries/incremental-compilation-in-detail.md
New upstream version 1.63.0+dfsg1
[rustc.git] / src / doc / rustc-dev-guide / src / queries / incremental-compilation-in-detail.md
CommitLineData
532ac7d7
XL
1# Incremental Compilation In Detail
2
6a06907d
XL
3<!-- toc -->
4
532ac7d7
XL
5The incremental compilation scheme is, in essence, a surprisingly
6simple extension to the overall query system. It relies on the fact that:
7
8 1. queries are pure functions -- given the same inputs, a query will always
9 yield the same result, and
10 2. the query model structures compilation in an acyclic graph that makes
11 dependencies between individual computations explicit.
12
13This chapter will explain how we can use these properties for making things
14incremental and then goes on to discuss version implementation issues.
15
ba9703b0 16## A Basic Algorithm For Incremental Query Evaluation
532ac7d7
XL
17
18As explained in the [query evaluation model primer][query-model], query
19invocations form a directed-acyclic graph. Here's the example from the
20previous chapter again:
21
22```ignore
23 list_of_all_hir_items <----------------------------- type_check_crate()
24 |
25 |
26 Hir(foo) <--- type_of(foo) <--- type_check_item(foo) <-------+
27 | |
28 +-----------------+ |
29 | |
30 v |
31 Hir(bar) <--- type_of(bar) <--- type_check_item(bar) <-------+
32```
33
34Since every access from one query to another has to go through the query
35context, we can record these accesses and thus actually build this dependency
36graph in memory. With dependency tracking enabled, when compilation is done,
37we know which queries were invoked (the nodes of the graph) and for each
38invocation, which other queries or input has gone into computing the query's
39result (the edges of the graph).
40
ba9703b0 41Now suppose we change the source code of our program so that
532ac7d7 42HIR of `bar` looks different than before. Our goal is to only recompute
ba9703b0 43those queries that are actually affected by the change while re-using
532ac7d7
XL
44the cached results of all the other queries. Given the dependency graph we can
45do exactly that. For a given query invocation, the graph tells us exactly
46what data has gone into computing its results, we just have to follow the
47edges until we reach something that has changed. If we don't encounter
48anything that has changed, we know that the query still would evaluate to
49the same result we already have in our cache.
50
ba9703b0 51Taking the `type_of(foo)` invocation from above as an example, we can check
532ac7d7
XL
52whether the cached result is still valid by following the edges to its
53inputs. The only edge leads to `Hir(foo)`, an input that has not been affected
54by the change. So we know that the cached result for `type_of(foo)` is still
55valid.
56
57The story is a bit different for `type_check_item(foo)`: We again walk the
58edges and already know that `type_of(foo)` is fine. Then we get to
59`type_of(bar)` which we have not checked yet, so we walk the edges of
60`type_of(bar)` and encounter `Hir(bar)` which *has* changed. Consequently
923072b8 61the result of `type_of(bar)` might yield a different result than what we
532ac7d7
XL
62have in the cache and, transitively, the result of `type_check_item(foo)`
63might have changed too. We thus re-run `type_check_item(foo)`, which in
64turn will re-run `type_of(bar)`, which will yield an up-to-date result
65because it reads the up-to-date version of `Hir(bar)`.
66
67
ba9703b0 68## The Problem With The Basic Algorithm: False Positives
532ac7d7 69
ba9703b0 70If you read the previous paragraph carefully you'll notice that it says that
532ac7d7
XL
71`type_of(bar)` *might* have changed because one of its inputs has changed.
72There's also the possibility that it might still yield exactly the same
73result *even though* its input has changed. Consider an example with a
74simple query that just computes the sign of an integer:
75
76```ignore
77 IntValue(x) <---- sign_of(x) <--- some_other_query(x)
78```
79
80Let's say that `IntValue(x)` starts out as `1000` and then is set to `2000`.
81Even though `IntValue(x)` is different in the two cases, `sign_of(x)` yields
82the result `+` in both cases.
83
84If we follow the basic algorithm, however, `some_other_query(x)` would have to
85(unnecessarily) be re-evaluated because it transitively depends on a changed
86input. Change detection yields a "false positive" in this case because it has
87to conservatively assume that `some_other_query(x)` might be affected by that
88changed input.
89
90Unfortunately it turns out that the actual queries in the compiler are full
91of examples like this and small changes to the input often potentially affect
92very large parts of the output binaries. As a consequence, we had to make the
93change detection system smarter and more accurate.
94
ba9703b0 95## Improving Accuracy: The red-green Algorithm
532ac7d7
XL
96
97The "false positives" problem can be solved by interleaving change detection
98and query re-evaluation. Instead of walking the graph all the way to the
99inputs when trying to find out if some cached result is still valid, we can
100check if a result has *actually* changed after we were forced to re-evaluate
101it.
102
ba9703b0 103We call this algorithm the red-green algorithm because nodes
532ac7d7
XL
104in the dependency graph are assigned the color green if we were able to prove
105that its cached result is still valid and the color red if the result has
106turned out to be different after re-evaluating it.
107
108The meat of red-green change tracking is implemented in the try-mark-green
109algorithm, that, you've guessed it, tries to mark a given node as green:
110
111```rust,ignore
112fn try_mark_green(tcx, current_node) -> bool {
113
114 // Fetch the inputs to `current_node`, i.e. get the nodes that the direct
115 // edges from `node` lead to.
116 let dependencies = tcx.dep_graph.get_dependencies_of(current_node);
117
118 // Now check all the inputs for changes
119 for dependency in dependencies {
120
121 match tcx.dep_graph.get_node_color(dependency) {
122 Green => {
123 // This input has already been checked before and it has not
124 // changed; so we can go on to check the next one
125 }
126 Red => {
127 // We found an input that has changed. We cannot mark
128 // `current_node` as green without re-running the
129 // corresponding query.
130 return false
131 }
132 Unknown => {
ba9703b0 133 // This is the first time we look at this node. Let's try
532ac7d7
XL
134 // to mark it green by calling try_mark_green() recursively.
135 if try_mark_green(tcx, dependency) {
136 // We successfully marked the input as green, on to the
137 // next.
138 } else {
139 // We could *not* mark the input as green. This means we
140 // don't know if its value has changed. In order to find
141 // out, we re-run the corresponding query now!
142 tcx.run_query_for(dependency);
143
144 // Fetch and check the node color again. Running the query
145 // has forced it to either red (if it yielded a different
146 // result than we have in the cache) or green (if it
147 // yielded the same result).
148 match tcx.dep_graph.get_node_color(dependency) {
149 Red => {
150 // The input turned out to be red, so we cannot
151 // mark `current_node` as green.
152 return false
153 }
154 Green => {
155 // Re-running the query paid off! The result is the
156 // same as before, so this particular input does
157 // not invalidate `current_node`.
158 }
159 Unknown => {
160 // There is no way a node has no color after
161 // re-running the query.
162 panic!("unreachable")
163 }
164 }
165 }
166 }
167 }
168 }
169
170 // If we have gotten through the entire loop, it means that all inputs
171 // have turned out to be green. If all inputs are unchanged, it means
172 // that the query result corresponding to `current_node` cannot have
173 // changed either.
174 tcx.dep_graph.mark_green(current_node);
175
176 true
177}
178
179// Note: The actual implementation can be found in
6a06907d 180// compiler/rustc_middle/src/dep_graph/graph.rs
532ac7d7
XL
181```
182
183By using red-green marking we can avoid the devastating cumulative effect of
184having false positives during change detection. Whenever a query is executed
185in incremental mode, we first check if its already green. If not, we run
186`try_mark_green()` on it. If it still isn't green after that, then we actually
187invoke the query provider to re-compute the result.
188
189
190
ba9703b0 191## The Real World: How Persistence Makes Everything Complicated
532ac7d7
XL
192
193The sections above described the underlying algorithm for incremental
194compilation but because the compiler process exits after being finished and
ba9703b0 195takes the query context with its result cache with it into oblivion, we have to
532ac7d7
XL
196persist data to disk, so the next compilation session can make use of it.
197This comes with a whole new set of implementation challenges:
198
ba9703b0 199- The query result cache is stored to disk, so they are not readily available
532ac7d7
XL
200 for change comparison.
201- A subsequent compilation session will start off with new version of the code
202 that has arbitrary changes applied to it. All kinds of IDs and indices that
203 are generated from a global, sequential counter (e.g. `NodeId`, `DefId`, etc)
204 might have shifted, making the persisted results on disk not immediately
205 usable anymore because the same numeric IDs and indices might refer to
206 completely new things in the new compilation session.
207- Persisting things to disk comes at a cost, so not every tiny piece of
208 information should be actually cached in between compilation sessions.
209 Fixed-sized, plain-old-data is preferred to complex things that need to run
ba9703b0 210 through an expensive (de-)serialization step.
532ac7d7 211
6a06907d 212The following sections describe how the compiler solves these issues.
532ac7d7 213
ba9703b0 214### A Question Of Stability: Bridging The Gap Between Compilation Sessions
532ac7d7
XL
215
216As noted before, various IDs (like `DefId`) are generated by the compiler in a
217way that depends on the contents of the source code being compiled. ID assignment
218is usually deterministic, that is, if the exact same code is compiled twice,
219the same things will end up with the same IDs. However, if something
220changes, e.g. a function is added in the middle of a file, there is no
221guarantee that anything will have the same ID as it had before.
222
223As a consequence we cannot represent the data in our on-disk cache the same
224way it is represented in memory. For example, if we just stored a piece
225of type information like `TyKind::FnDef(DefId, &'tcx Substs<'tcx>)` (as we do
226in memory) and then the contained `DefId` points to a different function in
227a new compilation session we'd be in trouble.
228
229The solution to this problem is to find "stable" forms for IDs which remain
230valid in between compilation sessions. For the most important case, `DefId`s,
231these are the so-called `DefPath`s. Each `DefId` has a
232corresponding `DefPath` but in place of a numeric ID, a `DefPath` is based on
233the path to the identified item, e.g. `std::collections::HashMap`. The
234advantage of an ID like this is that it is not affected by unrelated changes.
235For example, one can add a new function to `std::collections` but
236`std::collections::HashMap` would still be `std::collections::HashMap`. A
237`DefPath` is "stable" across changes made to the source code while a `DefId`
238isn't.
239
240There is also the `DefPathHash` which is just a 128-bit hash value of the
241`DefPath`. The two contain the same information and we mostly use the
242`DefPathHash` because it simpler to handle, being `Copy` and self-contained.
243
244This principle of stable identifiers is used to make the data in the on-disk
245cache resilient to source code changes. Instead of storing a `DefId`, we store
246the `DefPathHash` and when we deserialize something from the cache, we map the
247`DefPathHash` to the corresponding `DefId` in the *current* compilation session
248(which is just a simple hash table lookup).
249
250The `HirId`, used for identifying HIR components that don't have their own
251`DefId`, is another such stable ID. It is (conceptually) a pair of a `DefPath`
252and a `LocalId`, where the `LocalId` identifies something (e.g. a `hir::Expr`)
253locally within its "owner" (e.g. a `hir::Item`). If the owner is moved around,
254the `LocalId`s within it are still the same.
255
256
257
ba9703b0 258### Checking Query Results For Changes: HashStable And Fingerprints
532ac7d7
XL
259
260In order to do red-green-marking we often need to check if the result of a
261query has changed compared to the result it had during the previous
262compilation session. There are two performance problems with this though:
263
264- We'd like to avoid having to load the previous result from disk just for
265 doing the comparison. We already computed the new result and will use that.
266 Also loading a result from disk will "pollute" the interners with data that
267 is unlikely to ever be used.
268- We don't want to store each and every result in the on-disk cache. For
269 example, it would be wasted effort to persist things to disk that are
270 already available in upstream crates.
271
272The compiler avoids these problems by using so-called `Fingerprint`s. Each time
273a new query result is computed, the query engine will compute a 128 bit hash
274value of the result. We call this hash value "the `Fingerprint` of the query
275result". The hashing is (and has to be) done "in a stable way". This means
276that whenever something is hashed that might change in between compilation
277sessions (e.g. a `DefId`), we instead hash its stable equivalent
ba9703b0 278(e.g. the corresponding `DefPath`). That's what the whole `HashStable`
532ac7d7
XL
279infrastructure is for. This way `Fingerprint`s computed in two
280different compilation sessions are still comparable.
281
282The next step is to store these fingerprints along with the dependency graph.
283This is cheap since fingerprints are just bytes to be copied. It's also cheap to
284load the entire set of fingerprints together with the dependency graph.
285
286Now, when red-green-marking reaches the point where it needs to check if a
287result has changed, it can just compare the (already loaded) previous
288fingerprint to the fingerprint of the new result.
289
290This approach works rather well but it's not without flaws:
291
292- There is a small possibility of hash collisions. That is, two different
293 results could have the same fingerprint and the system would erroneously
294 assume that the result hasn't changed, leading to a missed update.
295
296 We mitigate this risk by using a high-quality hash function and a 128 bit
297 wide hash value. Due to these measures the practical risk of a hash
298 collision is negligible.
299
300- Computing fingerprints is quite costly. It is the main reason why incremental
301 compilation can be slower than non-incremental compilation. We are forced to
302 use a good and thus expensive hash function, and we have to map things to
303 their stable equivalents while doing the hashing.
304
532ac7d7 305
ba9703b0 306### A Tale Of Two DepGraphs: The Old And The New
532ac7d7
XL
307
308The initial description of dependency tracking glosses over a few details
309that quickly become a head scratcher when actually trying to implement things.
310In particular it's easy to overlook that we are actually dealing with *two*
311dependency graphs: The one we built during the previous compilation session and
312the one that we are building for the current compilation session.
313
314When a compilation session starts, the compiler loads the previous dependency
315graph into memory as an immutable piece of data. Then, when a query is invoked,
316it will first try to mark the corresponding node in the graph as green. This
317means really that we are trying to mark the node in the *previous* dep-graph
318as green that corresponds to the query key in the *current* session. How do we
319do this mapping between current query key and previous `DepNode`? The answer
320is again `Fingerprint`s: Nodes in the dependency graph are identified by a
321fingerprint of the query key. Since fingerprints are stable across compilation
322sessions, computing one in the current session allows us to find a node
323in the dependency graph from the previous session. If we don't find a node with
324the given fingerprint, it means that the query key refers to something that
325did not yet exist in the previous session.
326
327So, having found the dep-node in the previous dependency graph, we can look
ba9703b0 328up its dependencies (i.e. also dep-nodes in the previous graph) and continue with
532ac7d7
XL
329the rest of the try-mark-green algorithm. The next interesting thing happens
330when we successfully marked the node as green. At that point we copy the node
331and the edges to its dependencies from the old graph into the new graph. We
923072b8 332have to do this because the new dep-graph cannot acquire the
532ac7d7
XL
333node and edges via the regular dependency tracking. The tracking system can
334only record edges while actually running a query -- but running the query,
335although we have the result already cached, is exactly what we want to avoid.
336
337Once the compilation session has finished, all the unchanged parts have been
338copied over from the old into the new dependency graph, while the changed parts
339have been added to the new graph by the tracking system. At this point, the
340new graph is serialized out to disk, alongside the query result cache, and can
341act as the previous dep-graph in a subsequent compilation session.
342
343
ba9703b0
XL
344### Didn't You Forget Something?: Cache Promotion
345
346The system described so far has a somewhat subtle property: If all inputs of a
347dep-node are green then the dep-node itself can be marked as green without
348computing or loading the corresponding query result. Applying this property
349transitively often leads to the situation that some intermediate results are
350never actually loaded from disk, as in the following example:
351
352```ignore
353 input(A) <-- intermediate_query(B) <-- leaf_query(C)
354```
355
356The compiler might need the value of `leaf_query(C)` in order to generate some
357output artifact. If it can mark `leaf_query(C)` as green, it will load the
358result from the on-disk cache. The result of `intermediate_query(B)` is never
359loaded though. As a consequence, when the compiler persists the *new* result
360cache by writing all in-memory query results to disk, `intermediate_query(B)`
361will not be in memory and thus will be missing from the new result cache.
362
363If there subsequently is another compilation session that actually needs the
364result of `intermediate_query(B)` it will have to be re-computed even though we
365had a perfectly valid result for it in the cache just before.
366
367In order to prevent this from happening, the compiler does something called
368"cache promotion": Before emitting the new result cache it will walk all green
369dep-nodes and make sure that their query result is loaded into memory. That way
370the result cache doesn't unnecessarily shrink again.
371
372
373
374# Incremental Compilation and the Compiler Backend
375
376The compiler backend, the part involving LLVM, is using the query system but
377it is not implemented in terms of queries itself. As a consequence
378it does not automatically partake in dependency tracking. However, the manual
379integration with the tracking system is pretty straight-forward. The compiler
380simply tracks what queries get invoked when generating the initial LLVM version
381of each codegen unit, which results in a dep-node for each of them. In
382subsequent compilation sessions it then tries to mark the dep-node for a CGU as
383green. If it succeeds it knows that the corresponding object and bitcode files
384on disk are still valid. If it doesn't succeed, the entire codegen unit has to
385be recompiled.
386
387This is the same approach that is used for regular queries. The main differences
388are:
389
390 - that we cannot easily compute a fingerprint for LLVM modules (because
391 they are opaque C++ objects),
392
393 - that the logic for dealing with cached values is rather different from
394 regular queries because here we have bitcode and object files instead of
395 serialized Rust values in the common result cache file, and
396
397 - the operations around LLVM are so expensive in terms of computation time and
398 memory consumption that we need to have tight control over what is
399 executed when and what stays in memory for how long.
400
401The query system could probably be extended with general purpose mechanisms to
402deal with all of the above but so far that seemed like more trouble than it
403would save.
404
405
6a06907d
XL
406
407## Query Modifiers
408
409The query system allows for applying [modifiers][mod] to queries. These
410modifiers affect certain aspects of how the system treats the query with
411respect to incremental compilation:
412
413 - `eval_always` - A query with the `eval_always` attribute is re-executed
414 unconditionally during incremental compilation. I.e. the system will not
415 even try to mark the query's dep-node as green. This attribute has two use
416 cases:
417
418 - `eval_always` queries can read inputs (from files, global state, etc).
419 They can also produce side effects like writing to files and changing global state.
420
421 - Some queries are very likely to be re-evaluated because their result
422 depends on the entire source code. In this case `eval_always` can be used
423 as an optimization because the system can skip recording dependencies in
424 the first place.
425
426 - `no_hash` - Applying `no_hash` to a query tells the system to not compute
427 the fingerprint of the query's result. This has two consequences:
428
429 - Not computing the fingerprint can save quite a bit of time because
430 fingerprinting is expensive, especially for large, complex values.
431
432 - Without the fingerprint, the system has to unconditionally assume that
433 the result of the query has changed. As a consequence anything depending
434 on a `no_hash` query will always be re-executed.
435
17df50a5
XL
436 Using `no_hash` for a query can make sense in two circumstances:
437
438 - If the result of the query is very likely to change whenever one of its
439 inputs changes, e.g. a function like `|a, b, c| -> (a * b * c)`. In such
440 a case recomputing the query will always yield a red node if one of the
441 inputs is red so we can spare us the trouble and default to red immediately.
442 A counter example would be a function like `|a| -> (a == 42)` where the
443 result does not change for most changes of `a`.
444
445 - If the result of a query is a big, monolithic collection (e.g. `index_hir`)
446 and there are "projection queries" reading from that collection
447 (e.g. `hir_owner`). In such a case the big collection will likely fulfill the
448 condition above (any changed input means recomputing the whole collection)
449 and the results of the projection queries will be hashed anyway. If we also
450 hashed the collection query it would mean that we effectively hash the same
451 data twice: once when hashing the collection and another time when hashing all
452 the projection query results. `no_hash` allows us to avoid that redundancy
453 and the projection queries act as a "firewall", shielding their dependents
454 from the unconditionally red `no_hash` node.
455
6a06907d
XL
456 - `cache_on_disk_if` - This attribute is what determines which query results
457 are persisted in the incremental compilation query result cache. The
458 attribute takes an expression that allows per query invocation
459 decisions. For example, it makes no sense to store values from upstream
460 crates in the cache because they are already available in the upstream
461 crate's metadata.
462
463 - `anon` - This attribute makes the system use "anonymous" dep-nodes for the
464 given query. An anonymous dep-node is not identified by the corresponding
465 query key, instead its ID is computed from the IDs of its dependencies. This
466 allows the red-green system to do its change detection even if there is no
467 query key available for a given dep-node -- something which is needed for
468 handling trait selection because it is not based on queries.
469
470[mod]: ../query.html#adding-a-new-kind-of-query
471
472
473## The Projection Query Pattern
474
475It's interesting to note that `eval_always` and `no_hash` can be used together
476in the so-called "projection query" pattern. It is often the case that there is
477one query that depends on the entirety of the compiler's input (e.g. the indexed HIR)
478and another query that projects individual values out of this monolithic value
479(e.g. a HIR item with a certain `DefId`). These projection queries allow for
480building change propagation "firewalls" because even if the result of the
481monolithic query changes (which it is very likely to do) the small projections
482can still mostly be marked as green.
483
484
485```ignore
486 +------------+
487 | | +---------------+ +--------+
488 | | <---------| projection(x) | <---------| foo(a) |
489 | | +---------------+ +--------+
490 | |
491 | monolithic | +---------------+ +--------+
492 | query | <---------| projection(y) | <---------| bar(b) |
493 | | +---------------+ +--------+
494 | |
495 | | +---------------+ +--------+
496 | | <---------| projection(z) | <---------| baz(c) |
497 | | +---------------+ +--------+
498 +------------+
499```
500
501Let's assume that the result `monolithic_query` changes so that also the result
502of `projection(x)` has changed, i.e. both their dep-nodes are being marked as
503red. As a consequence `foo(a)` needs to be re-executed; but `bar(b)` and
504`baz(c)` can be marked as green. However, if `foo`, `bar`, and `baz` would have
505directly depended on `monolithic_query` then all of them would have had to be
506re-evaluated.
507
508This pattern works even without `eval_always` and `no_hash` but the two
509modifiers can be used to avoid unnecessary overhead. If the monolithic query
510is likely to change at any minor modification of the compiler's input it makes
511sense to mark it as `eval_always`, thus getting rid of its dependency tracking
512cost. And it always makes sense to mark the monolithic query as `no_hash`
513because we have the projections to take care of keeping things green as much
514as possible.
515
516
ba9703b0
XL
517# Shortcomings of the Current System
518
519There are many things that still can be improved.
520
521## Incrementality of on-disk data structures
522
523The current system is not able to update on-disk caches and the dependency graph
524in-place. Instead it has to rewrite each file entirely in each compilation
525session. The overhead of doing so is a few percent of total compilation time.
526
527## Unnecessary data dependencies
532ac7d7 528
ba9703b0
XL
529Data structures used as query results could be factored in a way that removes
530edges from the dependency graph. Especially "span" information is very volatile,
531so including it in query result will increase the chance that that result won't
6a06907d 532be reusable. See <https://github.com/rust-lang/rust/issues/47389> for more
ba9703b0 533information.
532ac7d7 534
532ac7d7
XL
535
536
537[query-model]: ./query-evaluation-model-in-detail.html