[rustc.git] / src / doc / reference / src / macro-ambiguity.md

# Appendix: Macro Follow-Set Ambiguity Formal Specification

This page documents the formal specification of the follow rules for [Macros
By Example]. They were originally specified in [RFC 550], from which the bulk
of this text is copied, and expanded upon in subsequent RFCs.

## Definitions & Conventions

  - `macro`: anything invokable as `foo!(...)` in source code.
  - `MBE`: macro-by-example, a macro defined by `macro_rules`.
  - `matcher`: the left-hand-side of a rule in a `macro_rules` invocation, or a
    subportion thereof.
  - `macro parser`: the bit of code in the Rust parser that will parse the
    input using a grammar derived from all of the matchers.
  - `fragment`: The class of Rust syntax that a given matcher will accept (or
    "match").
  - `repetition` : a fragment that follows a regular repeating pattern
  - `NT`: non-terminal, the various "meta-variables" or repetition matchers
	that can appear in a matcher, specified in MBE syntax with a leading `$`
	character.
  - `simple NT`: a "meta-variable" non-terminal (further discussion below).
  - `complex NT`: a repetition matching non-terminal, specified via repetition
    operators (`\*`, `+`, `?`).
  - `token`: an atomic element of a matcher; i.e. identifiers, operators,
    open/close delimiters, *and* simple NT's.
  - `token tree`: a tree structure formed from tokens (the leaves), complex
    NT's, and finite sequences of token trees.
  - `delimiter token`: a token that is meant to divide the end of one fragment
    and the start of the next fragment.
  - `separator token`: an optional delimiter token in an complex NT that
    separates each pair of elements in the matched repetition.
  - `separated complex NT`: a complex NT that has its own separator token.
  - `delimited sequence`: a sequence of token trees with appropriate open- and
    close-delimiters at the start and end of the sequence.
  - `empty fragment`: The class of invisible Rust syntax that separates tokens,
    i.e. whitespace, or (in some lexical contexts), the empty token sequence.
  - `fragment specifier`: The identifier in a simple NT that specifies which
    fragment the NT accepts.
  - `language`: a context-free language.

Example:

```rust,compile_fail
macro_rules! i_am_an_mbe {
    (start $foo:expr $($i:ident),* end) => ($foo)
}
```

`(start $foo:expr $($i:ident),\* end)` is a matcher. The whole matcher is a
delimited sequence (with open- and close-delimiters `(` and `)`), and `$foo`
and `$i` are simple NT's with `expr` and `ident` as their respective fragment
specifiers.

`$(i:ident),\*` is *also* an NT; it is a complex NT that matches a
comma-separated repetition of identifiers. The `,` is the separator token for
the complex NT; it occurs in between each pair of elements (if any) of the
matched fragment.

Another example of a complex NT is `$(hi $e:expr ;)+`, which matches any
fragment of the form `hi <expr>; hi <expr>; ...` where `hi <expr>;` occurs at
least once. Note that this complex NT does not have a dedicated separator
token.

(Note that Rust's parser ensures that delimited sequences always occur with
proper nesting of token tree structure and correct matching of open- and
close-delimiters.)

We will tend to use the variable "M" to stand for a matcher, variables "t" and
"u" for arbitrary individual tokens, and the variables "tt" and "uu" for
arbitrary token trees. (The use of "tt" does present potential ambiguity with
its additional role as a fragment specifier; but it will be clear from context
which interpretation is meant.)

"SEP" will range over separator tokens, "OP" over the repetition operators
`\*`, `+`, and `?`, "OPEN"/"CLOSE" over matching token pairs surrounding a
delimited sequence (e.g. `[` and `]`).

Greek letters "α" "β" "γ" "δ"  stand for potentially empty token-tree sequences.
(However, the Greek letter "ε" (epsilon) has a special role in the presentation
and does not stand for a token-tree sequence.)

  * This Greek letter convention is usually just employed when the presence of
	a sequence is a technical detail; in particular, when we wish to *emphasize*
	that we are operating on a sequence of token-trees, we will use the notation
	"tt ..." for the sequence, not a Greek letter.

Note that a matcher is merely a token tree. A "simple NT", as mentioned above,
is an meta-variable NT; thus it is a non-repetition. For example, `$foo:ty` is
a simple NT but `$($foo:ty)+` is a complex NT.

Note also that in the context of this formalism, the term "token" generally
*includes* simple NTs.

Finally, it is useful for the reader to keep in mind that according to the
definitions of this formalism, no simple NT matches the empty fragment, and
likewise no token matches the empty fragment of Rust syntax. (Thus, the *only*
NT that can match the empty fragment is a complex NT.) This is not actually
true, because the `vis` matcher can match an empty fragment. Thus, for the
purposes of the formalism, we will treat `$v:vis` as actually being
`$($v:vis)?`, with a requirement that the matcher match an empty fragment.

### The Matcher Invariants

To be valid, a matcher must meet the following three invariants. The definitions
of FIRST and FOLLOW are described later.

1.  For any two successive token tree sequences in a matcher `M` (i.e. `M = ...
    tt uu ...`) with `uu ...` nonempty, we must have FOLLOW(`... tt`) ∪ {ε} ⊇
    FIRST(`uu ...`).
1.  For any separated complex NT in a matcher, `M = ... $(tt ...) SEP OP ...`,
	we must have `SEP` ∈ FOLLOW(`tt ...`).
1.  For an unseparated complex NT in a matcher, `M = ... $(tt ...) OP ...`, if
    OP = `\*` or `+`, we must have FOLLOW(`tt ...`) ⊇ FIRST(`tt ...`).

The first invariant says that whatever actual token that comes after a matcher,
if any, must be somewhere in the predetermined follow set.  This ensures that a
legal macro definition will continue to assign the same determination as to
where `... tt` ends and `uu ...` begins, even as new syntactic forms are added
to the language.

The second invariant says that a separated complex NT must use a separator token
that is part of the predetermined follow set for the internal contents of the
NT. This ensures that a legal macro definition will continue to parse an input
fragment into the same delimited sequence of `tt ...`'s, even as new syntactic
forms are added to the language.

The third invariant says that when we have a complex NT that can match two or
more copies of the same thing with no separation in between, it must be
permissible for them to be placed next to each other as per the first invariant.
This invariant also requires they be nonempty, which eliminates a possible
ambiguity.

**NOTE: The third invariant is currently unenforced due to historical oversight
and significant reliance on the behaviour. It is currently undecided what to do
about this going forward. Macros that do not respect the behaviour may become
invalid in a future edition of Rust. See the [tracking issue].**

### FIRST and FOLLOW, informally

A given matcher M maps to three sets: FIRST(M), LAST(M) and FOLLOW(M).

Each of the three sets is made up of tokens. FIRST(M) and LAST(M) may also
contain a distinguished non-token element ε ("epsilon"), which indicates that M
can match the empty fragment. (But FOLLOW(M) is always just a set of tokens.)

Informally:

  * FIRST(M): collects the tokens potentially used first when matching a
    fragment to M.

  * LAST(M): collects the tokens potentially used last when matching a fragment
    to M.

  * FOLLOW(M): the set of tokens allowed to follow immediately after some
    fragment matched by M.

    In other words: t ∈ FOLLOW(M) if and only if there exists (potentially
    empty) token sequences α, β, γ, δ where:

      * M matches β,

      * t matches γ, and

      * The concatenation α β γ δ is a parseable Rust program.

We use the shorthand ANYTOKEN to denote the set of all tokens (including simple
NTs). For example, if any token is legal after a matcher M, then FOLLOW(M) =
ANYTOKEN.

(To review one's understanding of the above informal descriptions, the reader
at this point may want to jump ahead to the [examples of
FIRST/LAST](#examples-of-first-and-last) before reading their formal
definitions.)

### FIRST, LAST

Below are formal inductive definitions for FIRST and LAST.

"A ∪ B" denotes set union, "A ∩ B" denotes set intersection, and "A \ B"
denotes set difference (i.e. all elements of A that are not present in B).

#### FIRST

FIRST(M) is defined by case analysis on the sequence M and the structure of its
first token-tree (if any):

  * if M is the empty sequence, then FIRST(M) = { ε },

  * if M starts with a token t, then FIRST(M) = { t },

    (Note: this covers the case where M starts with a delimited token-tree
    sequence, `M = OPEN tt ... CLOSE ...`, in which case `t = OPEN` and thus
    FIRST(M) = { `OPEN` }.)

    (Note: this critically relies on the property that no simple NT matches the
    empty fragment.)

  * Otherwise, M is a token-tree sequence starting with a complex NT: `M = $( tt
    ... ) OP α`, or `M = $( tt ... ) SEP OP α`, (where `α` is the (potentially
    empty) sequence of token trees for the rest of the matcher).

      * Let SEP\_SET(M) = { SEP } if SEP is present and ε ∈ FIRST(`tt ...`);
        otherwise SEP\_SET(M) = {}.

  * Let ALPHA\_SET(M) = FIRST(`α`) if OP = `\*` or `?` and ALPHA\_SET(M) = {} if
    OP = `+`.
  * FIRST(M) = (FIRST(`tt ...`) \\ {ε}) ∪ SEP\_SET(M) ∪ ALPHA\_SET(M).

The definition for complex NTs deserves some justification. SEP\_SET(M) defines
the possibility that the separator could be a valid first token for M, which
happens when there is a separator defined and the repeated fragment could be
empty. ALPHA\_SET(M) defines the possibility that the complex NT could be empty,
meaning that M's valid first tokens are those of the following token-tree
sequences `α`. This occurs when either `\*` or `?` is used, in which case there
could be zero repetitions. In theory, this could also occur if `+` was used with
a potentially-empty repeating fragment, but this is forbidden by the third
invariant.

From there, clearly FIRST(M) can include any token from SEP\_SET(M) or
ALPHA\_SET(M), and if the complex NT match is nonempty, then any token starting
FIRST(`tt ...`) could work too. The last piece to consider is ε. SEP\_SET(M) and
FIRST(`tt ...`) \ {ε} cannot contain ε, but ALPHA\_SET(M) could. Hence, this
definition allows M to accept ε if and only if ε ∈ ALPHA\_SET(M) does. This is
correct because for M to accept ε in the complex NT case, both the complex NT
and α must accept it. If OP = `+`, meaning that the complex NT cannot be empty,
then by definition ε ∉ ALPHA\_SET(M). Otherwise, the complex NT can accept zero
repetitions, and then ALPHA\_SET(M) = FOLLOW(`α`). So this definition is correct
with respect to \varepsilon as well.

#### LAST

LAST(M), defined by case analysis on M itself (a sequence of token-trees):

  * if M is the empty sequence, then LAST(M) = { ε }

  * if M is a singleton token t, then LAST(M) = { t }

  * if M is the singleton complex NT repeating zero or more times, `M = $( tt
    ... ) *`, or `M = $( tt ... ) SEP *`

      * Let sep_set = { SEP } if SEP present; otherwise sep_set = {}.

      * if ε ∈ LAST(`tt ...`) then LAST(M) = LAST(`tt ...`) ∪ sep_set

      * otherwise, the sequence `tt ...` must be non-empty; LAST(M) = LAST(`tt
        ...`) ∪ {ε}.

  * if M is the singleton complex NT repeating one or more times, `M = $( tt ...
    ) +`, or `M = $( tt ... ) SEP +`

      * Let sep_set = { SEP } if SEP present; otherwise sep_set = {}.

      * if ε ∈ LAST(`tt ...`) then LAST(M) = LAST(`tt ...`) ∪ sep_set

      * otherwise, the sequence `tt ...` must be non-empty; LAST(M) = LAST(`tt
        ...`)

  * if M is the singleton complex NT repeating zero or one time, `M = $( tt ...)
    ?`, then LAST(M) = LAST(`tt ...`) ∪ {ε}.

  * if M is a delimited token-tree sequence `OPEN tt ... CLOSE`, then LAST(M) =
    { `CLOSE` }.

  * if M is a non-empty sequence of token-trees `tt uu ...`,

      * If ε ∈ LAST(`uu ...`), then LAST(M) = LAST(`tt`) ∪ (LAST(`uu ...`) \ { ε }).

      * Otherwise, the sequence `uu ...` must be non-empty; then LAST(M) =
        LAST(`uu ...`).

### Examples of FIRST and LAST

Below are some examples of FIRST and LAST.
(Note in particular how the special ε element is introduced and
eliminated based on the interaction between the pieces of the input.)

Our first example is presented in a tree structure to elaborate on how
the analysis of the matcher composes. (Some of the simpler subtrees
have been elided.)

```text
INPUT:  $(  $d:ident   $e:expr   );*    $( $( h )* );*    $( f ; )+   g
            ~~~~~~~~   ~~~~~~~                ~
                |         |                   |
FIRST:   { $d:ident }  { $e:expr }          { h }


INPUT:  $(  $d:ident   $e:expr   );*    $( $( h )* );*    $( f ; )+
            ~~~~~~~~~~~~~~~~~~             ~~~~~~~           ~~~
                        |                      |               |
FIRST:          { $d:ident }               { h, ε }         { f }

INPUT:  $(  $d:ident   $e:expr   );*    $( $( h )* );*    $( f ; )+   g
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~    ~~~~~~~~~~~~~~    ~~~~~~~~~   ~
                        |                       |              |       |
FIRST:        { $d:ident, ε }            {  h, ε, ;  }      { f }   { g }


INPUT:  $(  $d:ident   $e:expr   );*    $( $( h )* );*    $( f ; )+   g
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                        |
FIRST:                       { $d:ident, h, ;,  f }
```

Thus:

 * FIRST(`$($d:ident $e:expr );* $( $(h)* );* $( f ;)+ g`) = { `$d:ident`, `h`, `;`, `f` }

Note however that:

 * FIRST(`$($d:ident $e:expr );* $( $(h)* );* $($( f ;)+ g)*`) = { `$d:ident`, `h`, `;`, `f`, ε }

Here are similar examples but now for LAST.

 * LAST(`$d:ident $e:expr`) = { `$e:expr` }
 * LAST(`$( $d:ident $e:expr );*`) = { `$e:expr`, ε }
 * LAST(`$( $d:ident $e:expr );* $(h)*`) = { `$e:expr`, ε, `h` }
 * LAST(`$( $d:ident $e:expr );* $(h)* $( f ;)+`) = { `;` }
 * LAST(`$( $d:ident $e:expr );* $(h)* $( f ;)+ g`) = { `g` }

### FOLLOW(M)

Finally, the definition for FOLLOW(M) is built up as follows. pat, expr, etc.
represent simple nonterminals with the given fragment specifier.

  * FOLLOW(pat) = {`=>`, `,`, `=`, `|`, `if`, `in`}`.

  * FOLLOW(expr) = FOLLOW(stmt) =  {`=>`, `,`, `;`}`.

  * FOLLOW(ty) = FOLLOW(path) = {`{`, `[`, `,`, `=>`, `:`, `=`, `>`, `>>`, `;`,
    `|`, `as`, `where`, block nonterminals}.

  * FOLLOW(vis) = {`,`l any keyword or identifier except a non-raw `priv`; any
    token that can begin a type; ident, ty, and path nonterminals}.

  * FOLLOW(t) = ANYTOKEN for any other simple token, including block, ident,
    tt, item, lifetime, literal and meta simple nonterminals, and all terminals.

  * FOLLOW(M), for any other M, is defined as the intersection, as t ranges over
    (LAST(M) \ {ε}), of FOLLOW(t).

The tokens that can begin a type are, as of this writing, {`(`, `[`, `!`, `\*`,
`&`, `&&`, `?`, lifetimes, `>`, `>>`, `::`, any non-keyword identifier, `super`,
`self`, `Self`, `extern`, `crate`, `$crate`, `_`, `for`, `impl`, `fn`, `unsafe`,
`typeof`, `dyn`}, although this list may not be complete because people won't
always remember to update the appendix when new ones are added.

Examples of FOLLOW for complex M:

 * FOLLOW(`$( $d:ident $e:expr )\*`) = FOLLOW(`$e:expr`)
 * FOLLOW(`$( $d:ident $e:expr )\* $(;)\*`) = FOLLOW(`$e:expr`) ∩ ANYTOKEN = FOLLOW(`$e:expr`)
 * FOLLOW(`$( $d:ident $e:expr )\* $(;)\* $( f |)+`) = ANYTOKEN

### Examples of valid and invalid matchers

With the above specification in hand, we can present arguments for
why particular matchers are legal and others are not.

 * `($ty:ty < foo ,)` : illegal, because FIRST(`< foo ,`) = { `<` } ⊈ FOLLOW(`ty`)

 * `($ty:ty , foo <)` : legal, because FIRST(`, foo <`) = { `,` }  is ⊆ FOLLOW(`ty`).

 * `($pa:pat $pb:pat $ty:ty ,)` : illegal, because FIRST(`$pb:pat $ty:ty ,`) = { `$pb:pat` } ⊈ FOLLOW(`pat`), and also FIRST(`$ty:ty ,`) = { `$ty:ty` } ⊈ FOLLOW(`pat`).

 * `( $($a:tt $b:tt)* ; )` : legal, because FIRST(`$b:tt`) = { `$b:tt` } is ⊆ FOLLOW(`tt`) = ANYTOKEN, as is FIRST(`;`) = { `;` }.

 * `( $($t:tt),* , $(t:tt),* )` : legal,  (though any attempt to actually use this macro will signal a local ambiguity error during expansion).

 * `($ty:ty $(; not sep)* -)` : illegal, because FIRST(`$(; not sep)* -`) = { `;`, `-` } is not in FOLLOW(`ty`).

 * `($($ty:ty)-+)` : illegal, because separator `-` is not in FOLLOW(`ty`).

 * `($($e:expr)*)` : illegal, because expr NTs are not in FOLLOW(expr NT).

[Macros by Example]: macros-by-example.md
[RFC 550]: https://github.com/rust-lang/rfcs/blob/master/text/0550-macro-future-proofing.md
[tracking issue]: https://github.com/rust-lang/rust/issues/56575
Commit	Line	Data
532ac7d7 XL	1	# Appendix: Macro Follow-Set Ambiguity Formal Specification
	2
	3	This page documents the formal specification of the follow rules for [Macros
	4	By Example]. They were originally specified in [RFC 550], from which the bulk
	5	of this text is copied, and expanded upon in subsequent RFCs.
	6
	7	## Definitions & Conventions
	8
	9	- `macro`: anything invokable as `foo!(...)` in source code.
	10	- `MBE`: macro-by-example, a macro defined by `macro_rules`.
	11	- `matcher`: the left-hand-side of a rule in a `macro_rules` invocation, or a
	12	subportion thereof.
	13	- `macro parser`: the bit of code in the Rust parser that will parse the
	14	input using a grammar derived from all of the matchers.
	15	- `fragment`: The class of Rust syntax that a given matcher will accept (or
	16	"match").
	17	- `repetition` : a fragment that follows a regular repeating pattern
	18	- `NT`: non-terminal, the various "meta-variables" or repetition matchers
	19	that can appear in a matcher, specified in MBE syntax with a leading `$`
	20	character.
	21	- `simple NT`: a "meta-variable" non-terminal (further discussion below).
	22	- `complex NT`: a repetition matching non-terminal, specified via repetition
	23	operators (`\*`, `+`, `?`).
	24	- `token`: an atomic element of a matcher; i.e. identifiers, operators,
	25	open/close delimiters, and simple NT's.
	26	- `token tree`: a tree structure formed from tokens (the leaves), complex
	27	NT's, and finite sequences of token trees.
	28	- `delimiter token`: a token that is meant to divide the end of one fragment
	29	and the start of the next fragment.
	30	- `separator token`: an optional delimiter token in an complex NT that
	31	separates each pair of elements in the matched repetition.
	32	- `separated complex NT`: a complex NT that has its own separator token.
	33	- `delimited sequence`: a sequence of token trees with appropriate open- and
	34	close-delimiters at the start and end of the sequence.
	35	- `empty fragment`: The class of invisible Rust syntax that separates tokens,
	36	i.e. whitespace, or (in some lexical contexts), the empty token sequence.
	37	- `fragment specifier`: The identifier in a simple NT that specifies which
	38	fragment the NT accepts.
	39	- `language`: a context-free language.
	40
	41	Example:
	42
	43	```rust,compile_fail
	44	macro_rules! i_am_an_mbe {
	45	(start $foo:expr $($i:ident),* end) => ($foo)
	46	}
	47	```
	48
	49	`(start $foo:expr $($i:ident),\* end)` is a matcher. The whole matcher is a
	50	delimited sequence (with open- and close-delimiters `(` and `)`), and `$foo`
	51	and `$i` are simple NT's with `expr` and `ident` as their respective fragment
	52	specifiers.
	53
	54	`$(i:ident),\` is also* an NT; it is a complex NT that matches a
	55	comma-separated repetition of identifiers. The `,` is the separator token for
	56	the complex NT; it occurs in between each pair of elements (if any) of the
	57	matched fragment.
	58
	59	Another example of a complex NT is `$(hi $e:expr ;)+`, which matches any
	60	fragment of the form `hi <expr>; hi <expr>; ...` where `hi <expr>;` occurs at
	61	least once. Note that this complex NT does not have a dedicated separator
	62	token.
	63
	64	(Note that Rust's parser ensures that delimited sequences always occur with
65	proper nesting of token tree structure and correct matching of open- and
66	close-delimiters.)
67
68	We will tend to use the variable "M" to stand for a matcher, variables "t" and
69	"u" for arbitrary individual tokens, and the variables "tt" and "uu" for
70	arbitrary token trees. (The use of "tt" does present potential ambiguity with
71	its additional role as a fragment specifier; but it will be clear from context
72	which interpretation is meant.)
73
74	"SEP" will range over separator tokens, "OP" over the repetition operators
75	`\*`, `+`, and `?`, "OPEN"/"CLOSE" over matching token pairs surrounding a
76	delimited sequence (e.g. `[` and `]`).
77
78	Greek letters "α" "β" "γ" "δ" stand for potentially empty token-tree sequences.
79	(However, the Greek letter "ε" (epsilon) has a special role in the presentation
80	and does not stand for a token-tree sequence.)
81
82	* This Greek letter convention is usually just employed when the presence of
83	a sequence is a technical detail; in particular, when we wish to emphasize
84	that we are operating on a sequence of token-trees, we will use the notation
85	"tt ..." for the sequence, not a Greek letter.
86
87	Note that a matcher is merely a token tree. A "simple NT", as mentioned above,
88	is an meta-variable NT; thus it is a non-repetition. For example, `$foo:ty` is
89	a simple NT but `$($foo:ty)+` is a complex NT.
90
91	Note also that in the context of this formalism, the term "token" generally
92	includes simple NTs.
93
94	Finally, it is useful for the reader to keep in mind that according to the
95	definitions of this formalism, no simple NT matches the empty fragment, and
96	likewise no token matches the empty fragment of Rust syntax. (Thus, the only
97	NT that can match the empty fragment is a complex NT.) This is not actually
98	true, because the `vis` matcher can match an empty fragment. Thus, for the
99	purposes of the formalism, we will treat `$v:vis` as actually being
100	`$($v:vis)?`, with a requirement that the matcher match an empty fragment.
101
102	### The Matcher Invariants
103
104	To be valid, a matcher must meet the following three invariants. The definitions
105	of FIRST and FOLLOW are described later.
106
107	1. For any two successive token tree sequences in a matcher `M` (i.e. `M = ...
108	tt uu ...`) with `uu ...` nonempty, we must have FOLLOW(`... tt`) ∪ {ε} ⊇
109	FIRST(`uu ...`).
110	1. For any separated complex NT in a matcher, `M = ... $(tt ...) SEP OP ...`,
111	we must have `SEP` ∈ FOLLOW(`tt ...`).
112	1. For an unseparated complex NT in a matcher, `M = ... $(tt ...) OP ...`, if
113	OP = `\*` or `+`, we must have FOLLOW(`tt ...`) ⊇ FIRST(`tt ...`).
114
115	The first invariant says that whatever actual token that comes after a matcher,
116	if any, must be somewhere in the predetermined follow set. This ensures that a
117	legal macro definition will continue to assign the same determination as to
118	where `... tt` ends and `uu ...` begins, even as new syntactic forms are added
119	to the language.
120
121	The second invariant says that a separated complex NT must use a separator token
122	that is part of the predetermined follow set for the internal contents of the
123	NT. This ensures that a legal macro definition will continue to parse an input
124	fragment into the same delimited sequence of `tt ...`'s, even as new syntactic
125	forms are added to the language.
126
127	The third invariant says that when we have a complex NT that can match two or
128	more copies of the same thing with no separation in between, it must be
129	permissible for them to be placed next to each other as per the first invariant.
130	This invariant also requires they be nonempty, which eliminates a possible
131	ambiguity.
132
133	**NOTE: The third invariant is currently unenforced due to historical oversight
134	and significant reliance on the behaviour. It is currently undecided what to do
135	about this going forward. Macros that do not respect the behaviour may become
136	invalid in a future edition of Rust. See the [tracking issue].**
137
138	### FIRST and FOLLOW, informally
139
140	A given matcher M maps to three sets: FIRST(M), LAST(M) and FOLLOW(M).
141
142	Each of the three sets is made up of tokens. FIRST(M) and LAST(M) may also
143	contain a distinguished non-token element ε ("epsilon"), which indicates that M
144	can match the empty fragment. (But FOLLOW(M) is always just a set of tokens.)
145
146	Informally:
147
148	* FIRST(M): collects the tokens potentially used first when matching a
149	fragment to M.
150
151	* LAST(M): collects the tokens potentially used last when matching a fragment
152	to M.
153
154	* FOLLOW(M): the set of tokens allowed to follow immediately after some
155	fragment matched by M.
156
157	In other words: t ∈ FOLLOW(M) if and only if there exists (potentially
158	empty) token sequences α, β, γ, δ where:
159
160	* M matches β,
161
162	* t matches γ, and
163
164	* The concatenation α β γ δ is a parseable Rust program.
165
166	We use the shorthand ANYTOKEN to denote the set of all tokens (including simple
167	NTs). For example, if any token is legal after a matcher M, then FOLLOW(M) =
168	ANYTOKEN.
169
170	(To review one's understanding of the above informal descriptions, the reader
171	at this point may want to jump ahead to the [examples of
172	FIRST/LAST](#examples-of-first-and-last) before reading their formal
173	definitions.)
174
175	### FIRST, LAST
176
177	Below are formal inductive definitions for FIRST and LAST.
178
179	"A ∪ B" denotes set union, "A ∩ B" denotes set intersection, and "A \ B"
180	denotes set difference (i.e. all elements of A that are not present in B).
181
182	#### FIRST
183
184	FIRST(M) is defined by case analysis on the sequence M and the structure of its
185	first token-tree (if any):
186
187	* if M is the empty sequence, then FIRST(M) = { ε },
188
189	* if M starts with a token t, then FIRST(M) = { t },
190
191	(Note: this covers the case where M starts with a delimited token-tree
192	sequence, `M = OPEN tt ... CLOSE ...`, in which case `t = OPEN` and thus
193	FIRST(M) = { `OPEN` }.)
194
195	(Note: this critically relies on the property that no simple NT matches the
196	empty fragment.)
197
198	* Otherwise, M is a token-tree sequence starting with a complex NT: `M = $( tt
199	... ) OP α`, or `M = $( tt ... ) SEP OP α`, (where `α` is the (potentially
200	empty) sequence of token trees for the rest of the matcher).
201
202	* Let SEP\_SET(M) = { SEP } if SEP is present and ε ∈ FIRST(`tt ...`);
203	otherwise SEP\_SET(M) = {}.
204
205	* Let ALPHA\_SET(M) = FIRST(`α`) if OP = `\*` or `?` and ALPHA\_SET(M) = {} if
206	OP = `+`.
207	* FIRST(M) = (FIRST(`tt ...`) \\ {ε}) ∪ SEP\_SET(M) ∪ ALPHA\_SET(M).
208
209	The definition for complex NTs deserves some justification. SEP\_SET(M) defines
210	the possibility that the separator could be a valid first token for M, which
211	happens when there is a separator defined and the repeated fragment could be
212	empty. ALPHA\_SET(M) defines the possibility that the complex NT could be empty,
213	meaning that M's valid first tokens are those of the following token-tree
214	sequences `α`. This occurs when either `\*` or `?` is used, in which case there
215	could be zero repetitions. In theory, this could also occur if `+` was used with
216	a potentially-empty repeating fragment, but this is forbidden by the third
217	invariant.
218
219	From there, clearly FIRST(M) can include any token from SEP\_SET(M) or
220	ALPHA\_SET(M), and if the complex NT match is nonempty, then any token starting
221	FIRST(`tt ...`) could work too. The last piece to consider is ε. SEP\_SET(M) and
222	FIRST(`tt ...`) \ {ε} cannot contain ε, but ALPHA\_SET(M) could. Hence, this
223	definition allows M to accept ε if and only if ε ∈ ALPHA\_SET(M) does. This is
224	correct because for M to accept ε in the complex NT case, both the complex NT
225	and α must accept it. If OP = `+`, meaning that the complex NT cannot be empty,
226	then by definition ε ∉ ALPHA\_SET(M). Otherwise, the complex NT can accept zero
227	repetitions, and then ALPHA\_SET(M) = FOLLOW(`α`). So this definition is correct
228	with respect to \varepsilon as well.
229
230	#### LAST
231
232	LAST(M), defined by case analysis on M itself (a sequence of token-trees):
233
234	* if M is the empty sequence, then LAST(M) = { ε }
235
236	* if M is a singleton token t, then LAST(M) = { t }
237
238	* if M is the singleton complex NT repeating zero or more times, `M = $( tt
239	... ) `, or `M = $( tt ... ) SEP `
240
241	* Let sep_set = { SEP } if SEP present; otherwise sep_set = {}.
242
243	* if ε ∈ LAST(`tt ...`) then LAST(M) = LAST(`tt ...`) ∪ sep_set
244
245	* otherwise, the sequence `tt ...` must be non-empty; LAST(M) = LAST(`tt
246	...`) ∪ {ε}.
247
248	* if M is the singleton complex NT repeating one or more times, `M = $( tt ...
249	) +`, or `M = $( tt ... ) SEP +`
250
251	* Let sep_set = { SEP } if SEP present; otherwise sep_set = {}.
252
253	* if ε ∈ LAST(`tt ...`) then LAST(M) = LAST(`tt ...`) ∪ sep_set
254
255	* otherwise, the sequence `tt ...` must be non-empty; LAST(M) = LAST(`tt
256	...`)
257
258	* if M is the singleton complex NT repeating zero or one time, `M = $( tt ...)
259	?`, then LAST(M) = LAST(`tt ...`) ∪ {ε}.
260
261	* if M is a delimited token-tree sequence `OPEN tt ... CLOSE`, then LAST(M) =
262	{ `CLOSE` }.
263
264	* if M is a non-empty sequence of token-trees `tt uu ...`,
265
266	* If ε ∈ LAST(`uu ...`), then LAST(M) = LAST(`tt`) ∪ (LAST(`uu ...`) \ { ε }).
267
268	* Otherwise, the sequence `uu ...` must be non-empty; then LAST(M) =
269	LAST(`uu ...`).
270
271	### Examples of FIRST and LAST
532ac7d7 XL	272
	273	Below are some examples of FIRST and LAST.
	274	(Note in particular how the special ε element is introduced and
	275	eliminated based on the interaction between the pieces of the input.)
	276
	277	Our first example is presented in a tree structure to elaborate on how
	278	the analysis of the matcher composes. (Some of the simpler subtrees
	279	have been elided.)
	280
	281	```text
	282	INPUT: $( $d:ident $e:expr );* $( $( h )* );* $( f ; )+ g
	283	~~~~~~~~ ~~~~~~~ ~
	284	\| \| \|
	285	FIRST: { $d:ident } { $e:expr } { h }
	286
	287
	288	INPUT: $( $d:ident $e:expr );* $( $( h )* );* $( f ; )+
	289	~~~~~~~~~~~~~~~~~~ ~~~~~~~ ~~~
	290	\| \| \|
	291	FIRST: { $d:ident } { h, ε } { f }
	292
	293	INPUT: $( $d:ident $e:expr );* $( $( h )* );* $( f ; )+ g
	294	~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~ ~~~~~~~~~ ~
	295	\| \| \| \|
	296	FIRST: { $d:ident, ε } { h, ε, ; } { f } { g }
	297
	298
	299	INPUT: $( $d:ident $e:expr );* $( $( h )* );* $( f ; )+ g
	300	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	301	\|
	302	FIRST: { $d:ident, h, ;, f }
	303	```
	304
	305	Thus:
	306
	307	* FIRST(`$($d:ident $e:expr );* $( $(h)* );* $( f ;)+ g`) = { `$d:ident`, `h`, `;`, `f` }
	308
	309	Note however that:
	310
	311	* FIRST(`$($d:ident $e:expr );* $( $(h)* );* $($( f ;)+ g)*`) = { `$d:ident`, `h`, `;`, `f`, ε }
	312
	313	Here are similar examples but now for LAST.
	314
	315	* LAST(`$d:ident $e:expr`) = { `$e:expr` }
	316	* LAST(`$( $d:ident $e:expr );*`) = { `$e:expr`, ε }
	317	* LAST(`$( $d:ident $e:expr );* $(h)*`) = { `$e:expr`, ε, `h` }
	318	* LAST(`$( $d:ident $e:expr );* $(h)* $( f ;)+`) = { `;` }
	319	* LAST(`$( $d:ident $e:expr );* $(h)* $( f ;)+ g`) = { `g` }
	320
	321	### FOLLOW(M)
	322
	323	Finally, the definition for FOLLOW(M) is built up as follows. pat, expr, etc.
	324	represent simple nonterminals with the given fragment specifier.
	325
	326	* FOLLOW(pat) = {`=>`, `,`, `=`, `\|`, `if`, `in`}`.
	327
	328	* FOLLOW(expr) = FOLLOW(stmt) = {`=>`, `,`, `;`}`.
	329
	330	* FOLLOW(ty) = FOLLOW(path) = {`{`, `[`, `,`, `=>`, `:`, `=`, `>`, `>>`, `;`,
	331	`\|`, `as`, `where`, block nonterminals}.
	332
	333	* FOLLOW(vis) = {`,`l any keyword or identifier except a non-raw `priv`; any
	334	token that can begin a type; ident, ty, and path nonterminals}.
	335
336	* FOLLOW(t) = ANYTOKEN for any other simple token, including block, ident,
337	tt, item, lifetime, literal and meta simple nonterminals, and all terminals.
338
339	* FOLLOW(M), for any other M, is defined as the intersection, as t ranges over
340	(LAST(M) \ {ε}), of FOLLOW(t).
341
342	The tokens that can begin a type are, as of this writing, {`(`, `[`, `!`, `\*`,
343	`&`, `&&`, `?`, lifetimes, `>`, `>>`, `::`, any non-keyword identifier, `super`,
344	`self`, `Self`, `extern`, `crate`, `$crate`, `_`, `for`, `impl`, `fn`, `unsafe`,
345	`typeof`, `dyn`}, although this list may not be complete because people won't
346	always remember to update the appendix when new ones are added.
347
348	Examples of FOLLOW for complex M:
349
350	* FOLLOW(`$( $d:ident $e:expr )\*`) = FOLLOW(`$e:expr`)
351	* FOLLOW(`$( $d:ident $e:expr )\* $(;)\*`) = FOLLOW(`$e:expr`) ∩ ANYTOKEN = FOLLOW(`$e:expr`)
352	* FOLLOW(`$( $d:ident $e:expr )\* $(;)\* $( f \|)+`) = ANYTOKEN
353
354	### Examples of valid and invalid matchers
355
356	With the above specification in hand, we can present arguments for
357	why particular matchers are legal and others are not.
358
359	* `($ty:ty < foo ,)` : illegal, because FIRST(`< foo ,`) = { `<` } ⊈ FOLLOW(`ty`)
360
361	* `($ty:ty , foo <)` : legal, because FIRST(`, foo <`) = { `,` } is ⊆ FOLLOW(`ty`).
362
363	* `($pa:pat $pb:pat $ty:ty ,)` : illegal, because FIRST(`$pb:pat $ty:ty ,`) = { `$pb:pat` } ⊈ FOLLOW(`pat`), and also FIRST(`$ty:ty ,`) = { `$ty:ty` } ⊈ FOLLOW(`pat`).
364
365	* `( $($a:tt $b:tt)* ; )` : legal, because FIRST(`$b:tt`) = { `$b:tt` } is ⊆ FOLLOW(`tt`) = ANYTOKEN, as is FIRST(`;`) = { `;` }.
366
367	* `( $($t:tt),* , $(t:tt),* )` : legal, (though any attempt to actually use this macro will signal a local ambiguity error during expansion).
368
369	* `($ty:ty $(; not sep)* -)` : illegal, because FIRST(`$(; not sep)* -`) = { `;`, `-` } is not in FOLLOW(`ty`).
370
371	* `($($ty:ty)-+)` : illegal, because separator `-` is not in FOLLOW(`ty`).
372
373	* `($($e:expr)*)` : illegal, because expr NTs are not in FOLLOW(expr NT).
374
416331ca	375	[Macros by Example]: macros-by-example.md
74b04a01	376	[RFC 550]: https://github.com/rust-lang/rfcs/blob/master/text/0550-macro-future-proofing.md
532ac7d7	377	[tracking issue]: https://github.com/rust-lang/rust/issues/56575