1 # Skolemization and functions
3 One of the trickiest and most subtle aspects of regions is dealing
4 with higher-ranked things which include bound region variables, such
5 as function types. I strongly suggest that if you want to understand
6 the situation, you read this paper (which is, admittedly, very long,
7 but you don't have to read the whole thing):
9 http://research.microsoft.com/en-us/um/people/simonpj/papers/higher-rank/
11 Although my explanation will never compete with SPJ's (for one thing,
12 his is approximately 100 pages), I will attempt to explain the basic
13 problem and also how we solve it. Note that the paper only discusses
14 subtyping, not the computation of LUB/GLB.
16 The problem we are addressing is that there is a kind of subtyping
17 between functions with bound region parameters. Consider, for
18 example, whether the following relation holds:
20 for<'a> fn(&'a isize) <: for<'b> fn(&'b isize)? (Yes, a => b)
22 The answer is that of course it does. These two types are basically
23 the same, except that in one we used the name `a` and one we used
26 In the examples that follow, it becomes very important to know whether
27 a lifetime is bound in a function type (that is, is a lifetime
28 parameter) or appears free (is defined in some outer scope).
29 Therefore, from now on I will always write the bindings explicitly,
30 using the Rust syntax `for<'a> fn(&'a isize)` to indicate that `a` is a
33 Now let's consider two more function types. Here, we assume that the
34 `'b` lifetime is defined somewhere outside and hence is not a lifetime
35 parameter bound by the function type (it "appears free"):
37 for<'a> fn(&'a isize) <: fn(&'b isize)? (Yes, a => b)
39 This subtyping relation does in fact hold. To see why, you have to
40 consider what subtyping means. One way to look at `T1 <: T2` is to
41 say that it means that it is always ok to treat an instance of `T1` as
42 if it had the type `T2`. So, with our functions, it is always ok to
43 treat a function that can take pointers with any lifetime as if it
44 were a function that can only take a pointer with the specific
45 lifetime `'b`. After all, `'b` is a lifetime, after all, and
46 the function can take values of any lifetime.
48 You can also look at subtyping as the *is a* relationship. This amounts
49 to the same thing: a function that accepts pointers with any lifetime
50 *is a* function that accepts pointers with some specific lifetime.
52 So, what if we reverse the order of the two function types, like this:
54 fn(&'b isize) <: for<'a> fn(&'a isize)? (No)
56 Does the subtyping relationship still hold? The answer of course is
57 no. In this case, the function accepts *only the lifetime `'b`*,
58 so it is not reasonable to treat it as if it were a function that
59 accepted any lifetime.
61 What about these two examples:
63 for<'a,'b> fn(&'a isize, &'b isize) <: for<'a> fn(&'a isize, &'a isize)? (Yes)
64 for<'a> fn(&'a isize, &'a isize) <: for<'a,'b> fn(&'a isize, &'b isize)? (No)
66 Here, it is true that functions which take two pointers with any two
67 lifetimes can be treated as if they only accepted two pointers with
68 the same lifetime, but not the reverse.
72 Here is the algorithm we use to perform the subtyping check:
74 1. Replace all bound regions in the subtype with new variables
75 2. Replace all bound regions in the supertype with skolemized
76 equivalents. A "skolemized" region is just a new fresh region
78 3. Check that the parameter and return types match as normal
79 4. Ensure that no skolemized regions 'leak' into region variables
80 visible from "the outside"
82 Let's walk through some examples and see how this algorithm plays out.
86 We'll start with the first example, which was:
88 1. for<'a> fn(&'a T) <: for<'b> fn(&'b T)? Yes: a -> b
90 After steps 1 and 2 of the algorithm we will have replaced the types
93 1. fn(&'A T) <: fn(&'x T)?
95 Here the upper case `&A` indicates a *region variable*, that is, a
96 region whose value is being inferred by the system. I also replaced
97 `&b` with `&x`---I'll use letters late in the alphabet (`x`, `y`, `z`)
98 to indicate skolemized region names. We can assume they don't appear
99 elsewhere. Note that neither the sub- nor the supertype bind any
100 region names anymore (as indicated by the absence of `<` and `>`).
102 The next step is to check that the parameter types match. Because
103 parameters are contravariant, this means that we check whether:
107 Region pointers are contravariant so this implies that
111 must hold, where `<=` is the subregion relationship. Processing
112 *this* constrain simply adds a constraint into our graph that `&A <=
113 &x` and is considered successful (it can, for example, be satisfied by
114 choosing the value `&x` for `&A`).
116 So far we have encountered no error, so the subtype check succeeds.
118 #### The third example
120 Now let's look first at the third example, which was:
122 3. fn(&'a T) <: for<'b> fn(&'b T)? No!
124 After steps 1 and 2 of the algorithm we will have replaced the types
127 3. fn(&'a T) <: fn(&'x T)?
129 This looks pretty much the same as before, except that on the LHS
130 `'a` was not bound, and hence was left as-is and not replaced with
131 a variable. The next step is again to check that the parameter types
132 match. This will ultimately require (as before) that `'a` <= `&x`
133 must hold: but this does not hold. `self` and `x` are both distinct
134 free regions. So the subtype check fails.
136 #### Checking for skolemization leaks
138 You may be wondering about that mysterious last step in the algorithm.
139 So far it has not been relevant. The purpose of that last step is to
140 catch something like *this*:
142 for<'a> fn() -> fn(&'a T) <: fn() -> for<'b> fn(&'b T)? No.
144 Here the function types are the same but for where the binding occurs.
145 The subtype returns a function that expects a value in precisely one
146 region. The supertype returns a function that expects a value in any
147 region. If we allow an instance of the subtype to be used where the
148 supertype is expected, then, someone could call the fn and think that
149 the return value has type `fn<b>(&'b T)` when it really has type
150 `fn(&'a T)` (this is case #3, above). Bad.
152 So let's step through what happens when we perform this subtype check.
153 We first replace the bound regions in the subtype (the supertype has
154 no bound regions). This gives us:
156 fn() -> fn(&'A T) <: fn() -> for<'b> fn(&'b T)?
158 Now we compare the return types, which are covariant, and hence we have:
160 fn(&'A T) <: for<'b> fn(&'b T)?
162 Here we skolemize the bound region in the supertype to yield:
164 fn(&'A T) <: fn(&'x T)?
166 And then proceed to compare the argument types:
171 Finally, this is where it gets interesting! This is where an error
172 *should* be reported. But in fact this will not happen. The reason why
173 is that `A` is a variable: we will infer that its value is the fresh
174 region `x` and think that everything is happy. In fact, this behavior
175 is *necessary*, it was key to the first example we walked through.
177 The difference between this example and the first one is that the variable
178 `A` already existed at the point where the skolemization occurred. In
179 the first example, you had two functions:
181 for<'a> fn(&'a T) <: for<'b> fn(&'b T)
183 and hence `&A` and `&x` were created "together". In general, the
184 intention of the skolemized names is that they are supposed to be
185 fresh names that could never be equal to anything from the outside.
186 But when inference comes into play, we might not be respecting this
189 So the way we solve this is to add a fourth step that examines the
190 constraints that refer to skolemized names. Basically, consider a
191 non-directed version of the constraint graph. Let `Tainted(x)` be the
192 set of all things reachable from a skolemized variable `x`.
193 `Tainted(x)` should not contain any regions that existed before the
194 step at which the skolemization was performed. So this case here
195 would fail because `&x` was created alone, but is relatable to `&A`.
197 ## Computing the LUB and GLB
199 The paper I pointed you at is written for Haskell. It does not
200 therefore considering subtyping and in particular does not consider
201 LUB or GLB computation. We have to consider this. Here is the
202 algorithm I implemented.
204 First though, let's discuss what we are trying to compute in more
205 detail. The LUB is basically the "common supertype" and the GLB is
206 "common subtype"; one catch is that the LUB should be the
207 *most-specific* common supertype and the GLB should be *most general*
208 common subtype (as opposed to any common supertype or any common
211 Anyway, to help clarify, here is a table containing some function
212 pairs and their LUB/GLB (for conciseness, in this table, I'm just
213 including the lifetimes here, not the rest of the types, and I'm
214 writing `fn<>` instead of `for<> fn`):
217 Type 1 Type 2 LUB GLB
218 fn<'a>('a) fn('X) fn('X) fn<'a>('a)
219 fn('a) fn('X) -- fn<'a>('a)
220 fn<'a,'b>('a, 'b) fn<'x>('x, 'x) fn<'a>('a, 'a) fn<'a,'b>('a, 'b)
221 fn<'a,'b>('a, 'b, 'a) fn<'x,'y>('x, 'y, 'y) fn<'a>('a, 'a, 'a) fn<'a,'b,'c>('a,'b,'c)
226 I use lower-case letters (e.g., `&a`) for bound regions and upper-case
227 letters for free regions (`&A`). Region variables written with a
228 dollar-sign (e.g., `$a`). I will try to remember to enumerate the
229 bound-regions on the fn type as well (e.g., `for<'a> fn(&a)`).
231 ### High-level summary
233 Both the LUB and the GLB algorithms work in a similar fashion. They
234 begin by replacing all bound regions (on both sides) with fresh region
235 inference variables. Therefore, both functions are converted to types
236 that contain only free regions. We can then compute the LUB/GLB in a
237 straightforward way, as described in `combine.rs`. This results in an
238 interim type T. The algorithms then examine the regions that appear
239 in T and try to, in some cases, replace them with bound regions to
240 yield the final result.
242 To decide whether to replace a region `R` that appears in `T` with
243 a bound region, the algorithms make use of two bits of
244 information. First is a set `V` that contains all region
245 variables created as part of the LUB/GLB computation (roughly; see
246 `region_vars_confined_to_snapshot()` for full details). `V` will
247 contain the region variables created to replace the bound regions
248 in the input types, but it also contains 'intermediate' variables
249 created to represent the LUB/GLB of individual regions.
250 Basically, when asked to compute the LUB/GLB of a region variable
251 with another region, the inferencer cannot oblige immediately
252 since the values of that variables are not known. Therefore, it
253 creates a new variable that is related to the two regions. For
254 example, the LUB of two variables `$x` and `$y` is a fresh
255 variable `$z` that is constrained such that `$x <= $z` and `$y <=
256 $z`. So `V` will contain these intermediate variables as well.
258 The other important factor in deciding how to replace a region in T is
259 the function `Tainted($r)` which, for a region variable, identifies
260 all regions that the region variable is related to in some way
261 (`Tainted()` made an appearance in the subtype computation as well).
265 The LUB algorithm proceeds in three steps:
267 1. Replace all bound regions (on both sides) with fresh region
269 2. Compute the LUB "as normal", meaning compute the GLB of each
270 pair of argument types and the LUB of the return types and
271 so forth. Combine those to a new function type `F`.
272 3. Replace each region `R` that appears in `F` as follows:
273 - Let `V` be the set of variables created during the LUB
274 computational steps 1 and 2, as described in the previous section.
275 - If `R` is not in `V`, replace `R` with itself.
276 - If `Tainted(R)` contains a region that is not in `V`,
277 replace `R` with itself.
278 - Otherwise, select the earliest variable in `Tainted(R)` that originates
279 from the left-hand side and replace `R` with the bound region that
280 this variable was a replacement for.
282 So, let's work through the simplest example: `fn(&A)` and `for<'a> fn(&a)`.
283 In this case, `&a` will be replaced with `$a` and the interim LUB type
284 `fn($b)` will be computed, where `$b=GLB(&A,$a)`. Therefore, `V =
285 {$a, $b}` and `Tainted($b) = { $b, $a, &A }`. When we go to replace
286 `$b`, we find that since `&A \in Tainted($b)` is not a member of `V`,
287 we leave `$b` as is. When region inference happens, `$b` will be
288 resolved to `&A`, as we wanted.
290 Let's look at a more complex one: `fn(&a, &b)` and `fn(&x, &x)`. In
291 this case, we'll end up with a (pre-replacement) LUB type of `fn(&g,
292 &h)` and a graph that looks like:
301 Here `$g` and `$h` are fresh variables that are created to represent
302 the LUB/GLB of things requiring inference. This means that `V` and
303 `Tainted` will look like:
306 V = {$a, $b, $g, $h, $x}
307 Tainted($g) = Tainted($h) = { $a, $b, $h, $g, $x }
310 Therefore we replace both `$g` and `$h` with `$a`, and end up
311 with the type `fn(&a, &a)`.
315 The procedure for computing the GLB is similar. The difference lies
316 in computing the replacements for the various variables. For each
317 region `R` that appears in the type `F`, we again compute `Tainted(R)`
318 and examine the results:
320 1. If `R` is not in `V`, it is not replaced.
321 2. Else, if `Tainted(R)` contains only variables in `V`, and it
322 contains exactly one variable from the LHS and one variable from
323 the RHS, then `R` can be mapped to the bound version of the
324 variable from the LHS.
325 3. Else, if `Tainted(R)` contains no variable from the LHS and no
326 variable from the RHS, then `R` can be mapped to itself.
327 4. Else, `R` is mapped to a fresh bound variable.
329 These rules are pretty complex. Let's look at some examples to see
332 Out first example was `fn(&a)` and `fn(&X)`. In this case, `&a` will
333 be replaced with `$a` and we will ultimately compute a
334 (pre-replacement) GLB type of `fn($g)` where `$g=LUB($a,&X)`.
335 Therefore, `V={$a,$g}` and `Tainted($g)={$g,$a,&X}. To find the
336 replacement for `$g` we consult the rules above:
337 - Rule (1) does not apply because `$g \in V`
338 - Rule (2) does not apply because `&X \in Tainted($g)`
339 - Rule (3) does not apply because `$a \in Tainted($g)`
340 - Hence, by rule (4), we replace `$g` with a fresh bound variable `&z`.
341 So our final result is `fn(&z)`, which is correct.
343 The next example is `fn(&A)` and `fn(&Z)`. In this case, we will again
344 have a (pre-replacement) GLB of `fn(&g)`, where `$g = LUB(&A,&Z)`.
345 Therefore, `V={$g}` and `Tainted($g) = {$g, &A, &Z}`. In this case,
346 by rule (3), `$g` is mapped to itself, and hence the result is
347 `fn($g)`. This result is correct (in this case, at least), but it is
348 indicative of a case that *can* lead us into concluding that there is
349 no GLB when in fact a GLB does exist. See the section "Questionable
350 Results" below for more details.
352 The next example is `fn(&a, &b)` and `fn(&c, &c)`. In this case, as
353 before, we'll end up with `F=fn($g, $h)` where `Tainted($g) =
354 Tainted($h) = {$g, $h, $a, $b, $c}`. Only rule (4) applies and hence
355 we'll select fresh bound variables `y` and `z` and wind up with
358 For the last example, let's consider what may seem trivial, but is
359 not: `fn(&a, &a)` and `fn(&b, &b)`. In this case, we'll get `F=fn($g,
360 $h)` where `Tainted($g) = {$g, $a, $x}` and `Tainted($h) = {$h, $a,
361 $x}`. Both of these sets contain exactly one bound variable from each
362 side, so we'll map them both to `&a`, resulting in `fn(&a, &a)`, which
363 is the desired result.
365 ### Shortcomings and correctness
367 You may be wondering whether this algorithm is correct. The answer is
368 "sort of". There are definitely cases where they fail to compute a
369 result even though a correct result exists. I believe, though, that
370 if they succeed, then the result is valid, and I will attempt to
371 convince you. The basic argument is that the "pre-replacement" step
372 computes a set of constraints. The replacements, then, attempt to
373 satisfy those constraints, using bound identifiers where needed.
375 For now I will briefly go over the cases for LUB/GLB and identify
379 - The region variables that are substituted in place of bound regions
380 are intended to collect constraints on those bound regions.
381 - If Tainted(R) contains only values in V, then this region is unconstrained
382 and can therefore be generalized, otherwise it cannot.
384 - The region variables that are substituted in place of bound regions
385 are intended to collect constraints on those bound regions.
386 - If Tainted(R) contains exactly one variable from each side, and
387 only variables in V, that indicates that those two bound regions
389 - Otherwise, if Tainted(R) references any variables from left or right
390 side, then it is trying to combine a bound region with a free one or
391 multiple bound regions, so we need to select fresh bound regions.
393 Sorry this is more of a shorthand to myself. I will try to write up something
394 more convincing in the future.
396 #### Where are the algorithms wrong?
398 - The pre-replacement computation can fail even though using a
399 bound-region would have succeeded.
400 - We will compute GLB(fn(fn($a)), fn(fn($b))) as fn($c) where $c is the
401 GLB of $a and $b. But if inference finds that $a and $b must be mapped
402 to regions without a GLB, then this is effectively a failure to compute
403 the GLB. However, the result `fn<$c>(fn($c))` is a valid GLB.