]>
Commit | Line | Data |
---|---|---|
532ac7d7 XL |
1 | # Appendix: Macro Follow-Set Ambiguity Formal Specification |
2 | ||
3 | This page documents the formal specification of the follow rules for [Macros | |
4 | By Example]. They were originally specified in [RFC 550], from which the bulk | |
5 | of this text is copied, and expanded upon in subsequent RFCs. | |
6 | ||
7 | ## Definitions & Conventions | |
8 | ||
9 | - `macro`: anything invokable as `foo!(...)` in source code. | |
10 | - `MBE`: macro-by-example, a macro defined by `macro_rules`. | |
11 | - `matcher`: the left-hand-side of a rule in a `macro_rules` invocation, or a | |
12 | subportion thereof. | |
13 | - `macro parser`: the bit of code in the Rust parser that will parse the | |
14 | input using a grammar derived from all of the matchers. | |
15 | - `fragment`: The class of Rust syntax that a given matcher will accept (or | |
16 | "match"). | |
17 | - `repetition` : a fragment that follows a regular repeating pattern | |
18 | - `NT`: non-terminal, the various "meta-variables" or repetition matchers | |
19 | that can appear in a matcher, specified in MBE syntax with a leading `$` | |
20 | character. | |
21 | - `simple NT`: a "meta-variable" non-terminal (further discussion below). | |
22 | - `complex NT`: a repetition matching non-terminal, specified via repetition | |
23 | operators (`\*`, `+`, `?`). | |
24 | - `token`: an atomic element of a matcher; i.e. identifiers, operators, | |
25 | open/close delimiters, *and* simple NT's. | |
26 | - `token tree`: a tree structure formed from tokens (the leaves), complex | |
27 | NT's, and finite sequences of token trees. | |
28 | - `delimiter token`: a token that is meant to divide the end of one fragment | |
29 | and the start of the next fragment. | |
30 | - `separator token`: an optional delimiter token in an complex NT that | |
31 | separates each pair of elements in the matched repetition. | |
32 | - `separated complex NT`: a complex NT that has its own separator token. | |
33 | - `delimited sequence`: a sequence of token trees with appropriate open- and | |
34 | close-delimiters at the start and end of the sequence. | |
35 | - `empty fragment`: The class of invisible Rust syntax that separates tokens, | |
36 | i.e. whitespace, or (in some lexical contexts), the empty token sequence. | |
37 | - `fragment specifier`: The identifier in a simple NT that specifies which | |
38 | fragment the NT accepts. | |
39 | - `language`: a context-free language. | |
40 | ||
41 | Example: | |
42 | ||
43 | ```rust,compile_fail | |
44 | macro_rules! i_am_an_mbe { | |
45 | (start $foo:expr $($i:ident),* end) => ($foo) | |
46 | } | |
47 | ``` | |
48 | ||
49 | `(start $foo:expr $($i:ident),\* end)` is a matcher. The whole matcher is a | |
50 | delimited sequence (with open- and close-delimiters `(` and `)`), and `$foo` | |
51 | and `$i` are simple NT's with `expr` and `ident` as their respective fragment | |
52 | specifiers. | |
53 | ||
54 | `$(i:ident),\*` is *also* an NT; it is a complex NT that matches a | |
55 | comma-separated repetition of identifiers. The `,` is the separator token for | |
56 | the complex NT; it occurs in between each pair of elements (if any) of the | |
57 | matched fragment. | |
58 | ||
59 | Another example of a complex NT is `$(hi $e:expr ;)+`, which matches any | |
60 | fragment of the form `hi <expr>; hi <expr>; ...` where `hi <expr>;` occurs at | |
61 | least once. Note that this complex NT does not have a dedicated separator | |
62 | token. | |
63 | ||
64 | (Note that Rust's parser ensures that delimited sequences always occur with | |
65 | proper nesting of token tree structure and correct matching of open- and | |
66 | close-delimiters.) | |
67 | ||
68 | We will tend to use the variable "M" to stand for a matcher, variables "t" and | |
69 | "u" for arbitrary individual tokens, and the variables "tt" and "uu" for | |
70 | arbitrary token trees. (The use of "tt" does present potential ambiguity with | |
71 | its additional role as a fragment specifier; but it will be clear from context | |
72 | which interpretation is meant.) | |
73 | ||
74 | "SEP" will range over separator tokens, "OP" over the repetition operators | |
75 | `\*`, `+`, and `?`, "OPEN"/"CLOSE" over matching token pairs surrounding a | |
76 | delimited sequence (e.g. `[` and `]`). | |
77 | ||
78 | Greek letters "α" "β" "γ" "δ" stand for potentially empty token-tree sequences. | |
79 | (However, the Greek letter "ε" (epsilon) has a special role in the presentation | |
80 | and does not stand for a token-tree sequence.) | |
81 | ||
82 | * This Greek letter convention is usually just employed when the presence of | |
83 | a sequence is a technical detail; in particular, when we wish to *emphasize* | |
84 | that we are operating on a sequence of token-trees, we will use the notation | |
85 | "tt ..." for the sequence, not a Greek letter. | |
86 | ||
87 | Note that a matcher is merely a token tree. A "simple NT", as mentioned above, | |
88 | is an meta-variable NT; thus it is a non-repetition. For example, `$foo:ty` is | |
89 | a simple NT but `$($foo:ty)+` is a complex NT. | |
90 | ||
91 | Note also that in the context of this formalism, the term "token" generally | |
92 | *includes* simple NTs. | |
93 | ||
94 | Finally, it is useful for the reader to keep in mind that according to the | |
95 | definitions of this formalism, no simple NT matches the empty fragment, and | |
96 | likewise no token matches the empty fragment of Rust syntax. (Thus, the *only* | |
97 | NT that can match the empty fragment is a complex NT.) This is not actually | |
98 | true, because the `vis` matcher can match an empty fragment. Thus, for the | |
99 | purposes of the formalism, we will treat `$v:vis` as actually being | |
100 | `$($v:vis)?`, with a requirement that the matcher match an empty fragment. | |
101 | ||
102 | ### The Matcher Invariants | |
103 | ||
104 | To be valid, a matcher must meet the following three invariants. The definitions | |
105 | of FIRST and FOLLOW are described later. | |
106 | ||
107 | 1. For any two successive token tree sequences in a matcher `M` (i.e. `M = ... | |
108 | tt uu ...`) with `uu ...` nonempty, we must have FOLLOW(`... tt`) ∪ {ε} ⊇ | |
109 | FIRST(`uu ...`). | |
110 | 1. For any separated complex NT in a matcher, `M = ... $(tt ...) SEP OP ...`, | |
111 | we must have `SEP` ∈ FOLLOW(`tt ...`). | |
112 | 1. For an unseparated complex NT in a matcher, `M = ... $(tt ...) OP ...`, if | |
113 | OP = `\*` or `+`, we must have FOLLOW(`tt ...`) ⊇ FIRST(`tt ...`). | |
114 | ||
115 | The first invariant says that whatever actual token that comes after a matcher, | |
116 | if any, must be somewhere in the predetermined follow set. This ensures that a | |
117 | legal macro definition will continue to assign the same determination as to | |
118 | where `... tt` ends and `uu ...` begins, even as new syntactic forms are added | |
119 | to the language. | |
120 | ||
121 | The second invariant says that a separated complex NT must use a separator token | |
122 | that is part of the predetermined follow set for the internal contents of the | |
123 | NT. This ensures that a legal macro definition will continue to parse an input | |
124 | fragment into the same delimited sequence of `tt ...`'s, even as new syntactic | |
125 | forms are added to the language. | |
126 | ||
127 | The third invariant says that when we have a complex NT that can match two or | |
128 | more copies of the same thing with no separation in between, it must be | |
129 | permissible for them to be placed next to each other as per the first invariant. | |
130 | This invariant also requires they be nonempty, which eliminates a possible | |
131 | ambiguity. | |
132 | ||
133 | **NOTE: The third invariant is currently unenforced due to historical oversight | |
134 | and significant reliance on the behaviour. It is currently undecided what to do | |
135 | about this going forward. Macros that do not respect the behaviour may become | |
136 | invalid in a future edition of Rust. See the [tracking issue].** | |
137 | ||
138 | ### FIRST and FOLLOW, informally | |
139 | ||
140 | A given matcher M maps to three sets: FIRST(M), LAST(M) and FOLLOW(M). | |
141 | ||
142 | Each of the three sets is made up of tokens. FIRST(M) and LAST(M) may also | |
143 | contain a distinguished non-token element ε ("epsilon"), which indicates that M | |
144 | can match the empty fragment. (But FOLLOW(M) is always just a set of tokens.) | |
145 | ||
146 | Informally: | |
147 | ||
148 | * FIRST(M): collects the tokens potentially used first when matching a | |
149 | fragment to M. | |
150 | ||
151 | * LAST(M): collects the tokens potentially used last when matching a fragment | |
152 | to M. | |
153 | ||
154 | * FOLLOW(M): the set of tokens allowed to follow immediately after some | |
155 | fragment matched by M. | |
156 | ||
157 | In other words: t ∈ FOLLOW(M) if and only if there exists (potentially | |
158 | empty) token sequences α, β, γ, δ where: | |
159 | ||
160 | * M matches β, | |
161 | ||
162 | * t matches γ, and | |
163 | ||
164 | * The concatenation α β γ δ is a parseable Rust program. | |
165 | ||
166 | We use the shorthand ANYTOKEN to denote the set of all tokens (including simple | |
167 | NTs). For example, if any token is legal after a matcher M, then FOLLOW(M) = | |
168 | ANYTOKEN. | |
169 | ||
170 | (To review one's understanding of the above informal descriptions, the reader | |
171 | at this point may want to jump ahead to the [examples of | |
172 | FIRST/LAST](#examples-of-first-and-last) before reading their formal | |
173 | definitions.) | |
174 | ||
175 | ### FIRST, LAST | |
176 | ||
177 | Below are formal inductive definitions for FIRST and LAST. | |
178 | ||
179 | "A ∪ B" denotes set union, "A ∩ B" denotes set intersection, and "A \ B" | |
180 | denotes set difference (i.e. all elements of A that are not present in B). | |
181 | ||
182 | #### FIRST | |
183 | ||
184 | FIRST(M) is defined by case analysis on the sequence M and the structure of its | |
185 | first token-tree (if any): | |
186 | ||
187 | * if M is the empty sequence, then FIRST(M) = { ε }, | |
188 | ||
189 | * if M starts with a token t, then FIRST(M) = { t }, | |
190 | ||
191 | (Note: this covers the case where M starts with a delimited token-tree | |
192 | sequence, `M = OPEN tt ... CLOSE ...`, in which case `t = OPEN` and thus | |
193 | FIRST(M) = { `OPEN` }.) | |
194 | ||
195 | (Note: this critically relies on the property that no simple NT matches the | |
196 | empty fragment.) | |
197 | ||
198 | * Otherwise, M is a token-tree sequence starting with a complex NT: `M = $( tt | |
199 | ... ) OP α`, or `M = $( tt ... ) SEP OP α`, (where `α` is the (potentially | |
200 | empty) sequence of token trees for the rest of the matcher). | |
201 | ||
202 | * Let SEP\_SET(M) = { SEP } if SEP is present and ε ∈ FIRST(`tt ...`); | |
203 | otherwise SEP\_SET(M) = {}. | |
204 | ||
205 | * Let ALPHA\_SET(M) = FIRST(`α`) if OP = `\*` or `?` and ALPHA\_SET(M) = {} if | |
206 | OP = `+`. | |
207 | * FIRST(M) = (FIRST(`tt ...`) \\ {ε}) ∪ SEP\_SET(M) ∪ ALPHA\_SET(M). | |
208 | ||
209 | The definition for complex NTs deserves some justification. SEP\_SET(M) defines | |
210 | the possibility that the separator could be a valid first token for M, which | |
211 | happens when there is a separator defined and the repeated fragment could be | |
212 | empty. ALPHA\_SET(M) defines the possibility that the complex NT could be empty, | |
213 | meaning that M's valid first tokens are those of the following token-tree | |
214 | sequences `α`. This occurs when either `\*` or `?` is used, in which case there | |
215 | could be zero repetitions. In theory, this could also occur if `+` was used with | |
216 | a potentially-empty repeating fragment, but this is forbidden by the third | |
217 | invariant. | |
218 | ||
219 | From there, clearly FIRST(M) can include any token from SEP\_SET(M) or | |
220 | ALPHA\_SET(M), and if the complex NT match is nonempty, then any token starting | |
221 | FIRST(`tt ...`) could work too. The last piece to consider is ε. SEP\_SET(M) and | |
222 | FIRST(`tt ...`) \ {ε} cannot contain ε, but ALPHA\_SET(M) could. Hence, this | |
223 | definition allows M to accept ε if and only if ε ∈ ALPHA\_SET(M) does. This is | |
224 | correct because for M to accept ε in the complex NT case, both the complex NT | |
225 | and α must accept it. If OP = `+`, meaning that the complex NT cannot be empty, | |
226 | then by definition ε ∉ ALPHA\_SET(M). Otherwise, the complex NT can accept zero | |
227 | repetitions, and then ALPHA\_SET(M) = FOLLOW(`α`). So this definition is correct | |
228 | with respect to \varepsilon as well. | |
229 | ||
230 | #### LAST | |
231 | ||
232 | LAST(M), defined by case analysis on M itself (a sequence of token-trees): | |
233 | ||
234 | * if M is the empty sequence, then LAST(M) = { ε } | |
235 | ||
236 | * if M is a singleton token t, then LAST(M) = { t } | |
237 | ||
238 | * if M is the singleton complex NT repeating zero or more times, `M = $( tt | |
239 | ... ) *`, or `M = $( tt ... ) SEP *` | |
240 | ||
241 | * Let sep_set = { SEP } if SEP present; otherwise sep_set = {}. | |
242 | ||
243 | * if ε ∈ LAST(`tt ...`) then LAST(M) = LAST(`tt ...`) ∪ sep_set | |
244 | ||
245 | * otherwise, the sequence `tt ...` must be non-empty; LAST(M) = LAST(`tt | |
246 | ...`) ∪ {ε}. | |
247 | ||
248 | * if M is the singleton complex NT repeating one or more times, `M = $( tt ... | |
249 | ) +`, or `M = $( tt ... ) SEP +` | |
250 | ||
251 | * Let sep_set = { SEP } if SEP present; otherwise sep_set = {}. | |
252 | ||
253 | * if ε ∈ LAST(`tt ...`) then LAST(M) = LAST(`tt ...`) ∪ sep_set | |
254 | ||
255 | * otherwise, the sequence `tt ...` must be non-empty; LAST(M) = LAST(`tt | |
256 | ...`) | |
257 | ||
258 | * if M is the singleton complex NT repeating zero or one time, `M = $( tt ...) | |
259 | ?`, then LAST(M) = LAST(`tt ...`) ∪ {ε}. | |
260 | ||
261 | * if M is a delimited token-tree sequence `OPEN tt ... CLOSE`, then LAST(M) = | |
262 | { `CLOSE` }. | |
263 | ||
264 | * if M is a non-empty sequence of token-trees `tt uu ...`, | |
265 | ||
266 | * If ε ∈ LAST(`uu ...`), then LAST(M) = LAST(`tt`) ∪ (LAST(`uu ...`) \ { ε }). | |
267 | ||
268 | * Otherwise, the sequence `uu ...` must be non-empty; then LAST(M) = | |
269 | LAST(`uu ...`). | |
270 | ||
271 | ### Examples of FIRST and LAST | |
532ac7d7 XL |
272 | |
273 | Below are some examples of FIRST and LAST. | |
274 | (Note in particular how the special ε element is introduced and | |
275 | eliminated based on the interaction between the pieces of the input.) | |
276 | ||
277 | Our first example is presented in a tree structure to elaborate on how | |
278 | the analysis of the matcher composes. (Some of the simpler subtrees | |
279 | have been elided.) | |
280 | ||
281 | ```text | |
282 | INPUT: $( $d:ident $e:expr );* $( $( h )* );* $( f ; )+ g | |
283 | ~~~~~~~~ ~~~~~~~ ~ | |
284 | | | | | |
285 | FIRST: { $d:ident } { $e:expr } { h } | |
286 | ||
287 | ||
288 | INPUT: $( $d:ident $e:expr );* $( $( h )* );* $( f ; )+ | |
289 | ~~~~~~~~~~~~~~~~~~ ~~~~~~~ ~~~ | |
290 | | | | | |
291 | FIRST: { $d:ident } { h, ε } { f } | |
292 | ||
293 | INPUT: $( $d:ident $e:expr );* $( $( h )* );* $( f ; )+ g | |
294 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~ ~~~~~~~~~ ~ | |
295 | | | | | | |
296 | FIRST: { $d:ident, ε } { h, ε, ; } { f } { g } | |
297 | ||
298 | ||
299 | INPUT: $( $d:ident $e:expr );* $( $( h )* );* $( f ; )+ g | |
300 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
301 | | | |
302 | FIRST: { $d:ident, h, ;, f } | |
303 | ``` | |
304 | ||
305 | Thus: | |
306 | ||
307 | * FIRST(`$($d:ident $e:expr );* $( $(h)* );* $( f ;)+ g`) = { `$d:ident`, `h`, `;`, `f` } | |
308 | ||
309 | Note however that: | |
310 | ||
311 | * FIRST(`$($d:ident $e:expr );* $( $(h)* );* $($( f ;)+ g)*`) = { `$d:ident`, `h`, `;`, `f`, ε } | |
312 | ||
313 | Here are similar examples but now for LAST. | |
314 | ||
315 | * LAST(`$d:ident $e:expr`) = { `$e:expr` } | |
316 | * LAST(`$( $d:ident $e:expr );*`) = { `$e:expr`, ε } | |
317 | * LAST(`$( $d:ident $e:expr );* $(h)*`) = { `$e:expr`, ε, `h` } | |
318 | * LAST(`$( $d:ident $e:expr );* $(h)* $( f ;)+`) = { `;` } | |
319 | * LAST(`$( $d:ident $e:expr );* $(h)* $( f ;)+ g`) = { `g` } | |
320 | ||
321 | ### FOLLOW(M) | |
322 | ||
323 | Finally, the definition for FOLLOW(M) is built up as follows. pat, expr, etc. | |
324 | represent simple nonterminals with the given fragment specifier. | |
325 | ||
326 | * FOLLOW(pat) = {`=>`, `,`, `=`, `|`, `if`, `in`}`. | |
327 | ||
328 | * FOLLOW(expr) = FOLLOW(stmt) = {`=>`, `,`, `;`}`. | |
329 | ||
330 | * FOLLOW(ty) = FOLLOW(path) = {`{`, `[`, `,`, `=>`, `:`, `=`, `>`, `>>`, `;`, | |
331 | `|`, `as`, `where`, block nonterminals}. | |
332 | ||
333 | * FOLLOW(vis) = {`,`l any keyword or identifier except a non-raw `priv`; any | |
334 | token that can begin a type; ident, ty, and path nonterminals}. | |
335 | ||
336 | * FOLLOW(t) = ANYTOKEN for any other simple token, including block, ident, | |
337 | tt, item, lifetime, literal and meta simple nonterminals, and all terminals. | |
338 | ||
339 | * FOLLOW(M), for any other M, is defined as the intersection, as t ranges over | |
340 | (LAST(M) \ {ε}), of FOLLOW(t). | |
341 | ||
342 | The tokens that can begin a type are, as of this writing, {`(`, `[`, `!`, `\*`, | |
343 | `&`, `&&`, `?`, lifetimes, `>`, `>>`, `::`, any non-keyword identifier, `super`, | |
344 | `self`, `Self`, `extern`, `crate`, `$crate`, `_`, `for`, `impl`, `fn`, `unsafe`, | |
345 | `typeof`, `dyn`}, although this list may not be complete because people won't | |
346 | always remember to update the appendix when new ones are added. | |
347 | ||
348 | Examples of FOLLOW for complex M: | |
349 | ||
350 | * FOLLOW(`$( $d:ident $e:expr )\*`) = FOLLOW(`$e:expr`) | |
351 | * FOLLOW(`$( $d:ident $e:expr )\* $(;)\*`) = FOLLOW(`$e:expr`) ∩ ANYTOKEN = FOLLOW(`$e:expr`) | |
352 | * FOLLOW(`$( $d:ident $e:expr )\* $(;)\* $( f |)+`) = ANYTOKEN | |
353 | ||
354 | ### Examples of valid and invalid matchers | |
355 | ||
356 | With the above specification in hand, we can present arguments for | |
357 | why particular matchers are legal and others are not. | |
358 | ||
359 | * `($ty:ty < foo ,)` : illegal, because FIRST(`< foo ,`) = { `<` } ⊈ FOLLOW(`ty`) | |
360 | ||
361 | * `($ty:ty , foo <)` : legal, because FIRST(`, foo <`) = { `,` } is ⊆ FOLLOW(`ty`). | |
362 | ||
363 | * `($pa:pat $pb:pat $ty:ty ,)` : illegal, because FIRST(`$pb:pat $ty:ty ,`) = { `$pb:pat` } ⊈ FOLLOW(`pat`), and also FIRST(`$ty:ty ,`) = { `$ty:ty` } ⊈ FOLLOW(`pat`). | |
364 | ||
365 | * `( $($a:tt $b:tt)* ; )` : legal, because FIRST(`$b:tt`) = { `$b:tt` } is ⊆ FOLLOW(`tt`) = ANYTOKEN, as is FIRST(`;`) = { `;` }. | |
366 | ||
367 | * `( $($t:tt),* , $(t:tt),* )` : legal, (though any attempt to actually use this macro will signal a local ambiguity error during expansion). | |
368 | ||
369 | * `($ty:ty $(; not sep)* -)` : illegal, because FIRST(`$(; not sep)* -`) = { `;`, `-` } is not in FOLLOW(`ty`). | |
370 | ||
371 | * `($($ty:ty)-+)` : illegal, because separator `-` is not in FOLLOW(`ty`). | |
372 | ||
373 | * `($($e:expr)*)` : illegal, because expr NTs are not in FOLLOW(expr NT). | |
374 | ||
416331ca | 375 | [Macros by Example]: macros-by-example.md |
74b04a01 | 376 | [RFC 550]: https://github.com/rust-lang/rfcs/blob/master/text/0550-macro-future-proofing.md |
532ac7d7 | 377 | [tracking issue]: https://github.com/rust-lang/rust/issues/56575 |