]>
Commit | Line | Data |
---|---|---|
1 | [/============================================================================== | |
2 | Copyright (C) 2001-2015 Joel de Guzman | |
3 | Copyright (C) 2001-2011 Hartmut Kaiser | |
4 | ||
5 | Distributed under the Boost Software License, Version 1.0. (See accompanying | |
6 | file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) | |
7 | ===============================================================================/] | |
8 | [section Parsing Expression Grammar] | |
9 | ||
10 | Parsing Expression Grammars (PEG) [footnote Bryan Ford: Parsing Expression | |
11 | Grammars: A Recognition-Based Syntactic Foundation, | |
12 | [@http://pdos.csail.mit.edu/~baford/packrat/popl04/]] are a derivative of | |
13 | Extended Backus-Naur Form (EBNF) [footnote Richard E. Pattis: EBNF: A Notation | |
14 | to Describe Syntax, [@http://www.cs.cmu.edu/~pattis/misc/ebnf.pdf]] | |
15 | with a different interpretation, designed to represent a recursive descent | |
16 | parser. A PEG can be directly represented as a recursive-descent parser. | |
17 | ||
18 | Like EBNF, PEG is a formal grammar for describing a formal language in | |
19 | terms of a set of rules used to recognize strings of this language. | |
20 | Unlike EBNF, PEGs have an exact interpretation. There is only one valid | |
21 | parse tree (see __ast__) for each PEG grammar. | |
22 | ||
23 | [heading Sequences] | |
24 | ||
25 | Sequences are represented by juxtaposition like in EBNF: | |
26 | ||
27 | a b | |
28 | ||
29 | The PEG expression above states that, in order for this to succeed, | |
30 | `b` must follow `a`. Here's the syntax diagram: | |
31 | ||
32 | [:__sd_sequence__] | |
33 | ||
34 | Here's a trivial example: | |
35 | ||
36 | 'x' digit | |
37 | ||
38 | which means the character `x` must be followed by a digit. | |
39 | ||
40 | [note In __x3__, we use the `>>` for sequences since C++ does not | |
41 | allow juxtaposition.] | |
42 | ||
43 | [heading Alternatives] | |
44 | ||
45 | Alternatives are represented in PEG using the slash: | |
46 | ||
47 | a / b | |
48 | ||
49 | [note In __x3__, we use the `|` for alternatives just as in EBNF.] | |
50 | ||
51 | Alternatives allow for choices. The expression above reads: try to match | |
52 | `a`. If `a` succeeds, success, if not try to match `b`. This is a bit of | |
53 | a deviation from the usual EBNF interpretation where you simply match | |
54 | `a` *or* `b`. Here's the syntax diagram: | |
55 | ||
56 | [:__sd_choice__] | |
57 | ||
58 | PEGs allow for ambiguity in the alternatives. In the expression above, | |
59 | both `a` or `b` can both match an input string. However, only the first | |
60 | matching alternative is valid. As noted, there can only be one valid | |
61 | parse tree. [/FIXME: $$$ explain more about this $$$] | |
62 | ||
63 | [heading Loops] | |
64 | ||
65 | Again, like EBNF, PEG uses the regular-expression Kleene star and the | |
66 | plus loops: | |
67 | ||
68 | a* | |
69 | a+ | |
70 | ||
71 | [note __x3__ uses the prefix star and plus since there is no postfix star or | |
72 | plus in C++.] | |
73 | ||
74 | Here are the syntax diagrams: | |
75 | ||
76 | [:__sd_kleene__] | |
77 | [:__sd_plus__] | |
78 | ||
79 | The first, called the Kleene star, matches zero or more of its subject | |
80 | `a`. The second, plus, matches one ore more of its subject `a`. | |
81 | ||
82 | Unlike EBNF, PEGs have greedy loops. It will match as much as it can | |
83 | until its subject fails to match without regard to what follows. The following | |
84 | is a classic example of a fairly common EBNF/regex expression failing to match | |
85 | in PEG: | |
86 | ||
87 | alnum* digit | |
88 | ||
89 | In PEG, alnum will eat as much alpha-numeric characters as it can | |
90 | leaving nothing more left behind. Thus, the trailing digit will get | |
91 | nothing. Loops are simply implemented in recursive descent code as for/while | |
92 | loops making them extremely efficient. That is a definite advantage. On the | |
93 | other hand, those who are familiar with EBNF and regex behavior might find the | |
94 | behavior a major gotcha. PEG provides a couple of other mechanisms to circumvent | |
95 | this. We will see more of these other mechanisms shortly. | |
96 | ||
97 | [heading Difference] | |
98 | ||
99 | In some cases, you may want to restrict a certain expression. You can think of a | |
100 | PEG expression as a match for a potentially infinite set of strings. The | |
101 | difference operator allows you to restrict this set: | |
102 | ||
103 | a - b | |
104 | ||
105 | The expression reads: match `a` but not `b`. | |
106 | ||
107 | [endsect] |