]> git.proxmox.com Git - rustc.git/blob - src/librustc/infer/higher_ranked/README.md
Imported Upstream version 1.9.0+dfsg1
[rustc.git] / src / librustc / infer / higher_ranked / README.md
1 # Skolemization and functions
2
3 One of the trickiest and most subtle aspects of regions is dealing
4 with higher-ranked things which include bound region variables, such
5 as function types. I strongly suggest that if you want to understand
6 the situation, you read this paper (which is, admittedly, very long,
7 but you don't have to read the whole thing):
8
9 http://research.microsoft.com/en-us/um/people/simonpj/papers/higher-rank/
10
11 Although my explanation will never compete with SPJ's (for one thing,
12 his is approximately 100 pages), I will attempt to explain the basic
13 problem and also how we solve it. Note that the paper only discusses
14 subtyping, not the computation of LUB/GLB.
15
16 The problem we are addressing is that there is a kind of subtyping
17 between functions with bound region parameters. Consider, for
18 example, whether the following relation holds:
19
20 for<'a> fn(&'a isize) <: for<'b> fn(&'b isize)? (Yes, a => b)
21
22 The answer is that of course it does. These two types are basically
23 the same, except that in one we used the name `a` and one we used
24 the name `b`.
25
26 In the examples that follow, it becomes very important to know whether
27 a lifetime is bound in a function type (that is, is a lifetime
28 parameter) or appears free (is defined in some outer scope).
29 Therefore, from now on I will always write the bindings explicitly,
30 using the Rust syntax `for<'a> fn(&'a isize)` to indicate that `a` is a
31 lifetime parameter.
32
33 Now let's consider two more function types. Here, we assume that the
34 `'b` lifetime is defined somewhere outside and hence is not a lifetime
35 parameter bound by the function type (it "appears free"):
36
37 for<'a> fn(&'a isize) <: fn(&'b isize)? (Yes, a => b)
38
39 This subtyping relation does in fact hold. To see why, you have to
40 consider what subtyping means. One way to look at `T1 <: T2` is to
41 say that it means that it is always ok to treat an instance of `T1` as
42 if it had the type `T2`. So, with our functions, it is always ok to
43 treat a function that can take pointers with any lifetime as if it
44 were a function that can only take a pointer with the specific
45 lifetime `'b`. After all, `'b` is a lifetime, after all, and
46 the function can take values of any lifetime.
47
48 You can also look at subtyping as the *is a* relationship. This amounts
49 to the same thing: a function that accepts pointers with any lifetime
50 *is a* function that accepts pointers with some specific lifetime.
51
52 So, what if we reverse the order of the two function types, like this:
53
54 fn(&'b isize) <: for<'a> fn(&'a isize)? (No)
55
56 Does the subtyping relationship still hold? The answer of course is
57 no. In this case, the function accepts *only the lifetime `'b`*,
58 so it is not reasonable to treat it as if it were a function that
59 accepted any lifetime.
60
61 What about these two examples:
62
63 for<'a,'b> fn(&'a isize, &'b isize) <: for<'a> fn(&'a isize, &'a isize)? (Yes)
64 for<'a> fn(&'a isize, &'a isize) <: for<'a,'b> fn(&'a isize, &'b isize)? (No)
65
66 Here, it is true that functions which take two pointers with any two
67 lifetimes can be treated as if they only accepted two pointers with
68 the same lifetime, but not the reverse.
69
70 ## The algorithm
71
72 Here is the algorithm we use to perform the subtyping check:
73
74 1. Replace all bound regions in the subtype with new variables
75 2. Replace all bound regions in the supertype with skolemized
76 equivalents. A "skolemized" region is just a new fresh region
77 name.
78 3. Check that the parameter and return types match as normal
79 4. Ensure that no skolemized regions 'leak' into region variables
80 visible from "the outside"
81
82 Let's walk through some examples and see how this algorithm plays out.
83
84 #### First example
85
86 We'll start with the first example, which was:
87
88 1. for<'a> fn(&'a T) <: for<'b> fn(&'b T)? Yes: a -> b
89
90 After steps 1 and 2 of the algorithm we will have replaced the types
91 like so:
92
93 1. fn(&'A T) <: fn(&'x T)?
94
95 Here the upper case `&A` indicates a *region variable*, that is, a
96 region whose value is being inferred by the system. I also replaced
97 `&b` with `&x`---I'll use letters late in the alphabet (`x`, `y`, `z`)
98 to indicate skolemized region names. We can assume they don't appear
99 elsewhere. Note that neither the sub- nor the supertype bind any
100 region names anymore (as indicated by the absence of `<` and `>`).
101
102 The next step is to check that the parameter types match. Because
103 parameters are contravariant, this means that we check whether:
104
105 &'x T <: &'A T
106
107 Region pointers are contravariant so this implies that
108
109 &A <= &x
110
111 must hold, where `<=` is the subregion relationship. Processing
112 *this* constrain simply adds a constraint into our graph that `&A <=
113 &x` and is considered successful (it can, for example, be satisfied by
114 choosing the value `&x` for `&A`).
115
116 So far we have encountered no error, so the subtype check succeeds.
117
118 #### The third example
119
120 Now let's look first at the third example, which was:
121
122 3. fn(&'a T) <: for<'b> fn(&'b T)? No!
123
124 After steps 1 and 2 of the algorithm we will have replaced the types
125 like so:
126
127 3. fn(&'a T) <: fn(&'x T)?
128
129 This looks pretty much the same as before, except that on the LHS
130 `'a` was not bound, and hence was left as-is and not replaced with
131 a variable. The next step is again to check that the parameter types
132 match. This will ultimately require (as before) that `'a` <= `&x`
133 must hold: but this does not hold. `self` and `x` are both distinct
134 free regions. So the subtype check fails.
135
136 #### Checking for skolemization leaks
137
138 You may be wondering about that mysterious last step in the algorithm.
139 So far it has not been relevant. The purpose of that last step is to
140 catch something like *this*:
141
142 for<'a> fn() -> fn(&'a T) <: fn() -> for<'b> fn(&'b T)? No.
143
144 Here the function types are the same but for where the binding occurs.
145 The subtype returns a function that expects a value in precisely one
146 region. The supertype returns a function that expects a value in any
147 region. If we allow an instance of the subtype to be used where the
148 supertype is expected, then, someone could call the fn and think that
149 the return value has type `fn<b>(&'b T)` when it really has type
150 `fn(&'a T)` (this is case #3, above). Bad.
151
152 So let's step through what happens when we perform this subtype check.
153 We first replace the bound regions in the subtype (the supertype has
154 no bound regions). This gives us:
155
156 fn() -> fn(&'A T) <: fn() -> for<'b> fn(&'b T)?
157
158 Now we compare the return types, which are covariant, and hence we have:
159
160 fn(&'A T) <: for<'b> fn(&'b T)?
161
162 Here we skolemize the bound region in the supertype to yield:
163
164 fn(&'A T) <: fn(&'x T)?
165
166 And then proceed to compare the argument types:
167
168 &'x T <: &'A T
169 'A <= 'x
170
171 Finally, this is where it gets interesting! This is where an error
172 *should* be reported. But in fact this will not happen. The reason why
173 is that `A` is a variable: we will infer that its value is the fresh
174 region `x` and think that everything is happy. In fact, this behavior
175 is *necessary*, it was key to the first example we walked through.
176
177 The difference between this example and the first one is that the variable
178 `A` already existed at the point where the skolemization occurred. In
179 the first example, you had two functions:
180
181 for<'a> fn(&'a T) <: for<'b> fn(&'b T)
182
183 and hence `&A` and `&x` were created "together". In general, the
184 intention of the skolemized names is that they are supposed to be
185 fresh names that could never be equal to anything from the outside.
186 But when inference comes into play, we might not be respecting this
187 rule.
188
189 So the way we solve this is to add a fourth step that examines the
190 constraints that refer to skolemized names. Basically, consider a
191 non-directed version of the constraint graph. Let `Tainted(x)` be the
192 set of all things reachable from a skolemized variable `x`.
193 `Tainted(x)` should not contain any regions that existed before the
194 step at which the skolemization was performed. So this case here
195 would fail because `&x` was created alone, but is relatable to `&A`.
196
197 ## Computing the LUB and GLB
198
199 The paper I pointed you at is written for Haskell. It does not
200 therefore considering subtyping and in particular does not consider
201 LUB or GLB computation. We have to consider this. Here is the
202 algorithm I implemented.
203
204 First though, let's discuss what we are trying to compute in more
205 detail. The LUB is basically the "common supertype" and the GLB is
206 "common subtype"; one catch is that the LUB should be the
207 *most-specific* common supertype and the GLB should be *most general*
208 common subtype (as opposed to any common supertype or any common
209 subtype).
210
211 Anyway, to help clarify, here is a table containing some function
212 pairs and their LUB/GLB (for conciseness, in this table, I'm just
213 including the lifetimes here, not the rest of the types, and I'm
214 writing `fn<>` instead of `for<> fn`):
215
216 ```
217 Type 1 Type 2 LUB GLB
218 fn<'a>('a) fn('X) fn('X) fn<'a>('a)
219 fn('a) fn('X) -- fn<'a>('a)
220 fn<'a,'b>('a, 'b) fn<'x>('x, 'x) fn<'a>('a, 'a) fn<'a,'b>('a, 'b)
221 fn<'a,'b>('a, 'b, 'a) fn<'x,'y>('x, 'y, 'y) fn<'a>('a, 'a, 'a) fn<'a,'b,'c>('a,'b,'c)
222 ```
223
224 ### Conventions
225
226 I use lower-case letters (e.g., `&a`) for bound regions and upper-case
227 letters for free regions (`&A`). Region variables written with a
228 dollar-sign (e.g., `$a`). I will try to remember to enumerate the
229 bound-regions on the fn type as well (e.g., `for<'a> fn(&a)`).
230
231 ### High-level summary
232
233 Both the LUB and the GLB algorithms work in a similar fashion. They
234 begin by replacing all bound regions (on both sides) with fresh region
235 inference variables. Therefore, both functions are converted to types
236 that contain only free regions. We can then compute the LUB/GLB in a
237 straightforward way, as described in `combine.rs`. This results in an
238 interim type T. The algorithms then examine the regions that appear
239 in T and try to, in some cases, replace them with bound regions to
240 yield the final result.
241
242 To decide whether to replace a region `R` that appears in `T` with
243 a bound region, the algorithms make use of two bits of
244 information. First is a set `V` that contains all region
245 variables created as part of the LUB/GLB computation (roughly; see
246 `region_vars_confined_to_snapshot()` for full details). `V` will
247 contain the region variables created to replace the bound regions
248 in the input types, but it also contains 'intermediate' variables
249 created to represent the LUB/GLB of individual regions.
250 Basically, when asked to compute the LUB/GLB of a region variable
251 with another region, the inferencer cannot oblige immediately
252 since the values of that variables are not known. Therefore, it
253 creates a new variable that is related to the two regions. For
254 example, the LUB of two variables `$x` and `$y` is a fresh
255 variable `$z` that is constrained such that `$x <= $z` and `$y <=
256 $z`. So `V` will contain these intermediate variables as well.
257
258 The other important factor in deciding how to replace a region in T is
259 the function `Tainted($r)` which, for a region variable, identifies
260 all regions that the region variable is related to in some way
261 (`Tainted()` made an appearance in the subtype computation as well).
262
263 ### LUB
264
265 The LUB algorithm proceeds in three steps:
266
267 1. Replace all bound regions (on both sides) with fresh region
268 inference variables.
269 2. Compute the LUB "as normal", meaning compute the GLB of each
270 pair of argument types and the LUB of the return types and
271 so forth. Combine those to a new function type `F`.
272 3. Replace each region `R` that appears in `F` as follows:
273 - Let `V` be the set of variables created during the LUB
274 computational steps 1 and 2, as described in the previous section.
275 - If `R` is not in `V`, replace `R` with itself.
276 - If `Tainted(R)` contains a region that is not in `V`,
277 replace `R` with itself.
278 - Otherwise, select the earliest variable in `Tainted(R)` that originates
279 from the left-hand side and replace `R` with the bound region that
280 this variable was a replacement for.
281
282 So, let's work through the simplest example: `fn(&A)` and `for<'a> fn(&a)`.
283 In this case, `&a` will be replaced with `$a` and the interim LUB type
284 `fn($b)` will be computed, where `$b=GLB(&A,$a)`. Therefore, `V =
285 {$a, $b}` and `Tainted($b) = { $b, $a, &A }`. When we go to replace
286 `$b`, we find that since `&A \in Tainted($b)` is not a member of `V`,
287 we leave `$b` as is. When region inference happens, `$b` will be
288 resolved to `&A`, as we wanted.
289
290 Let's look at a more complex one: `fn(&a, &b)` and `fn(&x, &x)`. In
291 this case, we'll end up with a (pre-replacement) LUB type of `fn(&g,
292 &h)` and a graph that looks like:
293
294 ```
295 $a $b *--$x
296 \ \ / /
297 \ $h-* /
298 $g-----------*
299 ```
300
301 Here `$g` and `$h` are fresh variables that are created to represent
302 the LUB/GLB of things requiring inference. This means that `V` and
303 `Tainted` will look like:
304
305 ```
306 V = {$a, $b, $g, $h, $x}
307 Tainted($g) = Tainted($h) = { $a, $b, $h, $g, $x }
308 ```
309
310 Therefore we replace both `$g` and `$h` with `$a`, and end up
311 with the type `fn(&a, &a)`.
312
313 ### GLB
314
315 The procedure for computing the GLB is similar. The difference lies
316 in computing the replacements for the various variables. For each
317 region `R` that appears in the type `F`, we again compute `Tainted(R)`
318 and examine the results:
319
320 1. If `R` is not in `V`, it is not replaced.
321 2. Else, if `Tainted(R)` contains only variables in `V`, and it
322 contains exactly one variable from the LHS and one variable from
323 the RHS, then `R` can be mapped to the bound version of the
324 variable from the LHS.
325 3. Else, if `Tainted(R)` contains no variable from the LHS and no
326 variable from the RHS, then `R` can be mapped to itself.
327 4. Else, `R` is mapped to a fresh bound variable.
328
329 These rules are pretty complex. Let's look at some examples to see
330 how they play out.
331
332 Out first example was `fn(&a)` and `fn(&X)`. In this case, `&a` will
333 be replaced with `$a` and we will ultimately compute a
334 (pre-replacement) GLB type of `fn($g)` where `$g=LUB($a,&X)`.
335 Therefore, `V={$a,$g}` and `Tainted($g)={$g,$a,&X}. To find the
336 replacement for `$g` we consult the rules above:
337 - Rule (1) does not apply because `$g \in V`
338 - Rule (2) does not apply because `&X \in Tainted($g)`
339 - Rule (3) does not apply because `$a \in Tainted($g)`
340 - Hence, by rule (4), we replace `$g` with a fresh bound variable `&z`.
341 So our final result is `fn(&z)`, which is correct.
342
343 The next example is `fn(&A)` and `fn(&Z)`. In this case, we will again
344 have a (pre-replacement) GLB of `fn(&g)`, where `$g = LUB(&A,&Z)`.
345 Therefore, `V={$g}` and `Tainted($g) = {$g, &A, &Z}`. In this case,
346 by rule (3), `$g` is mapped to itself, and hence the result is
347 `fn($g)`. This result is correct (in this case, at least), but it is
348 indicative of a case that *can* lead us into concluding that there is
349 no GLB when in fact a GLB does exist. See the section "Questionable
350 Results" below for more details.
351
352 The next example is `fn(&a, &b)` and `fn(&c, &c)`. In this case, as
353 before, we'll end up with `F=fn($g, $h)` where `Tainted($g) =
354 Tainted($h) = {$g, $h, $a, $b, $c}`. Only rule (4) applies and hence
355 we'll select fresh bound variables `y` and `z` and wind up with
356 `fn(&y, &z)`.
357
358 For the last example, let's consider what may seem trivial, but is
359 not: `fn(&a, &a)` and `fn(&b, &b)`. In this case, we'll get `F=fn($g,
360 $h)` where `Tainted($g) = {$g, $a, $x}` and `Tainted($h) = {$h, $a,
361 $x}`. Both of these sets contain exactly one bound variable from each
362 side, so we'll map them both to `&a`, resulting in `fn(&a, &a)`, which
363 is the desired result.
364
365 ### Shortcomings and correctness
366
367 You may be wondering whether this algorithm is correct. The answer is
368 "sort of". There are definitely cases where they fail to compute a
369 result even though a correct result exists. I believe, though, that
370 if they succeed, then the result is valid, and I will attempt to
371 convince you. The basic argument is that the "pre-replacement" step
372 computes a set of constraints. The replacements, then, attempt to
373 satisfy those constraints, using bound identifiers where needed.
374
375 For now I will briefly go over the cases for LUB/GLB and identify
376 their intent:
377
378 - LUB:
379 - The region variables that are substituted in place of bound regions
380 are intended to collect constraints on those bound regions.
381 - If Tainted(R) contains only values in V, then this region is unconstrained
382 and can therefore be generalized, otherwise it cannot.
383 - GLB:
384 - The region variables that are substituted in place of bound regions
385 are intended to collect constraints on those bound regions.
386 - If Tainted(R) contains exactly one variable from each side, and
387 only variables in V, that indicates that those two bound regions
388 must be equated.
389 - Otherwise, if Tainted(R) references any variables from left or right
390 side, then it is trying to combine a bound region with a free one or
391 multiple bound regions, so we need to select fresh bound regions.
392
393 Sorry this is more of a shorthand to myself. I will try to write up something
394 more convincing in the future.
395
396 #### Where are the algorithms wrong?
397
398 - The pre-replacement computation can fail even though using a
399 bound-region would have succeeded.
400 - We will compute GLB(fn(fn($a)), fn(fn($b))) as fn($c) where $c is the
401 GLB of $a and $b. But if inference finds that $a and $b must be mapped
402 to regions without a GLB, then this is effectively a failure to compute
403 the GLB. However, the result `fn<$c>(fn($c))` is a valid GLB.