]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | <html> |
2 | <head> | |
3 | <title>The Scanner and Parsing</title> | |
4 | <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> | |
5 | <link rel="stylesheet" href="theme/style.css" type="text/css"> | |
6 | </head> | |
7 | ||
8 | <body> | |
9 | <table width="100%" border="0" background="theme/bkd2.gif" cellspacing="2"> | |
10 | <tr> | |
11 | <td width="10"> | |
12 | </td> | |
13 | <td width="85%"> | |
14 | <font size="6" face="Verdana, Arial, Helvetica, sans-serif"><b>The Scanner and Parsing</b></font> | |
15 | </td> | |
16 | <td width="112"><a href="http://spirit.sf.net"><img src="theme/spirit.gif" width="112" height="48" align="right" border="0"></a></td> | |
17 | </tr> | |
18 | </table> | |
19 | <br> | |
20 | <table border="0"> | |
21 | <tr> | |
22 | <td width="10"></td> | |
23 | <td width="30"><a href="../index.html"><img src="theme/u_arr.gif" border="0"></a></td> | |
24 | <td width="30"><a href="directives.html"><img src="theme/l_arr.gif" border="0"></a></td> | |
25 | <td width="30"><a href="grammar.html"><img src="theme/r_arr.gif" border="0"></a></td> | |
26 | </tr> | |
27 | </table> | |
28 | <p>The <b>scanner</b>'s task is to feed the sequential input data stream to the | |
29 | parser. The scanner extracts data from the input, parceling, potentially modifying | |
30 | or filtering, and then finally relegating the result to individual parser elements | |
31 | on demand until the input is exhausted. The scanner is composed of two STL conforming | |
32 | forward iterators, first and last, where first is held by reference and last, | |
33 | by value. The first iterator is held by reference to allow it to be re-positioned. | |
34 | The following diagram illustrates what's happening:</p> | |
35 | <table width="62%" border="0" align="center"> | |
36 | <tr> | |
37 | <td><img src="theme/scanner1.png"></td> | |
38 | </tr> | |
39 | </table> | |
40 | <p>The scanner manages various aspects of the parsing process through a set of | |
41 | policies. There are three sets of policies that govern:</p> | |
42 | <blockquote> | |
43 | <p><img src="theme/bullet.gif" width="12" height="12"> Iteration and filtering<br> | |
44 | <img src="theme/bullet.gif" width="12" height="12"> Recognition and matching<br> | |
45 | <img src="theme/bullet.gif" width="12" height="12"> Handling semantic actions</p> | |
46 | </blockquote> | |
47 | <p>These policies are mostly hidden from view and users generally need not know | |
48 | about them. Advanced users might however provide their own policies that override | |
49 | the ones that are already in place to fine tune the parsing process | |
50 | to fit their own needs. We shall see how this can be done. This will be covered | |
51 | in further detail later.</p> | |
52 | <p>The <tt>scanner</tt> is a template class expecting two parameters: <tt>IteratorT</tt>, | |
53 | the iterator type and <tt>PoliciesT</tt>, its set of policies. <tt>IteratorT</tt> | |
54 | defaults to <tt>char const*</tt> while <tt>PoliciesT</tt> defaults to <tt>scanner_policies<></tt>, | |
55 | a predefined set of scanner policies that we can use straight out of the box.</p> | |
56 | <pre><code><font color="#000000"><span class=keyword> template</span><span class=special>< | |
57 | </span><span class=keyword>typename </span><span class=identifier>IteratorT </span><span class=special>= </span><span class=keyword>char </span><span class=keyword>const</span><span class=special>*, | |
58 | </span><span class=keyword>typename </span><span class=identifier>PoliciesT </span><span class=special>= </span><span class=identifier>scanner_policies</span><span class=special><> </span><span class=special>> | |
59 | </span><span class=keyword>class </span><span class=identifier>scanner</span><span class=special>;</span></font></code></pre> | |
60 | <p>Spirit uses the same iterator concepts and interface formally defined by the | |
61 | C++ Standard Template Library (STL). We can use iterators supplied by STL's | |
62 | containers (e.g. <tt>list</tt>, <tt>vector</tt>, <tt>string</tt>, etc.) as is, | |
63 | or perhaps write our own. Iterators can be as simple as a pointer (e.g. <tt>char | |
64 | const<span class="operators">*</span></tt>). At the other end of the spectrum, | |
65 | iterators can be quite complex; for instance, an iterator adapter that wraps | |
66 | a lexer such as LEX.</p> | |
67 | <h2>The Free Parse Functions</h2> | |
68 | <p>The framework provides a couple of free functions to make parsing a snap. These | |
69 | parser functions have two forms. The first form works on the <b>character level</b>. | |
70 | The second works on the <b>phrase level</b> and asks for a <b>skip parser</b>.</p> | |
71 | <p>The <b>skip parser</b> is just about any parser primitive or composite. Its | |
72 | purpose is to move the scanner's <tt>first</tt> iterator to valid tokens by | |
73 | skipping white spaces. In C for instance, the tab <tt class="quotes">'\t'</tt>, | |
74 | the newline <tt class="quotes">'\n'</tt>, return <tt><span class="quotes">'\r'</span></tt>, | |
75 | space <tt class="quotes">' '</tt> and characters inside comments <tt class="quotes">/*...*/</tt> | |
76 | are considered as white spaces.</p> | |
77 | <p><b>Character level parsing</b></p> | |
78 | <pre><code><font color="#000000"><span class=special> </span><span class=keyword>template </span><span class=special><</span><span class=keyword>typename </span><span class=identifier>IteratorT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>DerivedT</span><span class=special>> | |
79 | </span><span class=identifier>parse_info</span><span class=special><</span><span class=identifier>IteratorT</span><span class=special>> | |
80 | </span><span class=identifier>parse | |
81 | </span><span class=special>( | |
82 | </span><span class=identifier>IteratorT </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>first</span><span class=special>, | |
83 | </span><span class=identifier>IteratorT </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>last</span><span class=special>, | |
84 | </span><span class=identifier>parser</span><span class=special><</span><span class=identifier>DerivedT</span><span class=special>> </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>p | |
85 | </span><span class=special>);</span></font></code></pre> | |
86 | <pre><code><font color="#000000"><span class=special> </span><span class=keyword>template </span><span class=special><</span><span class=keyword>typename </span><span class=identifier>CharT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>DerivedT</span><span class=special>> | |
87 | </span><span class=identifier>parse_info</span><span class=special><</span><span class=identifier>CharT </span><span class=keyword>const</span><span class=special>*> | |
88 | </span><span class=identifier>parse | |
89 | </span><span class=special>( | |
90 | </span><span class=identifier>CharT </span><span class=keyword>const</span><span class=special>* </span><span class=identifier>str</span><span class=special>, | |
91 | </span><span class=identifier>parser</span><span class=special><</span><span class=identifier>DerivedT</span><span class=special>> </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>p | |
92 | </span><span class=special>);</span></font></code></pre> | |
93 | <p>There are two variants. The first variant accepts a <tt>first</tt>, <tt>last</tt> | |
94 | iterator pair like you do STL algorithms. The second variant accepts a null | |
95 | terminated string. The last argument is a parser <tt>p</tt> which will be used | |
96 | to parse the input.</p> | |
97 | <p><b>Phrase level parsing</b></p> | |
98 | <pre><code><font color="#000000"><span class=special> </span><span class=keyword>template </span><span class=special><</span><span class=keyword>typename </span><span class=identifier>IteratorT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>ParserT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>SkipT</span><span class=special>> | |
99 | </span><span class=identifier>parse_info</span><span class=special><</span><span class=identifier>IteratorT</span><span class=special>> | |
100 | </span><span class=identifier>parse | |
101 | </span><span class=special>( | |
102 | </span><span class=identifier>IteratorT </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>first</span><span class=special>, | |
103 | </span><span class=identifier>IteratorT </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>last</span><span class=special>, | |
104 | </span><span class=identifier>parser</span><span class=special><</span><span class=identifier>ParserT</span><span class=special>> </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>p</span><span class=special>, | |
105 | </span><span class=identifier>parser</span><span class=special><</span><span class=identifier>SkipT</span><span class=special>> </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>skip | |
106 | </span><span class=special>);</span></font></code></pre> | |
107 | <pre><code><font color="#000000"><span class=special> </span><span class=keyword>template </span><span class=special><</span><span class=keyword>typename </span><span class=identifier>CharT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>ParserT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>SkipT</span><span class=special>> | |
108 | </span><span class=identifier>parse_info</span><span class=special><</span><span class=identifier>CharT </span><span class=keyword>const</span><span class=special>*> | |
109 | </span><span class=identifier>parse | |
110 | </span><span class=special>( | |
111 | </span><span class=identifier>CharT </span><span class=keyword>const</span><span class=special>* </span><span class=identifier>str</span><span class=special>, | |
112 | </span><span class=identifier>parser</span><span class=special><</span><span class=identifier>ParserT</span><span class=special>> </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>p</span><span class=special>, | |
113 | </span><span class=identifier>parser</span><span class=special><</span><span class=identifier>SkipT</span><span class=special>> </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>skip | |
114 | </span><span class=special>);</span></font></code></pre> | |
115 | <p>Like above, there are two variants. The first variant accepts a <tt>first</tt>, | |
116 | <tt>last</tt> iterator pair like you do STL algorithms. The second variant accepts | |
117 | a null terminated string. The argument <tt>p</tt> is the parser which will be | |
118 | used to parse the input. The last argument <tt>skip</tt> is the skip parser.</p> | |
119 | <p><b>The parse_info structure</b></p> | |
120 | <p>The functions above return a <tt>parse_info</tt> structure parameterized by | |
121 | the iterator type passed in. The parse_info struct has these members:</p> | |
122 | <table width="90%" border="0" align="center"> | |
123 | <tr> | |
124 | <td colspan="2" class="table_title"><b>parse_info</b></td> | |
125 | </tr> | |
126 | <tr> | |
127 | <td width="14%" class="table_cells"><b>stop</b></td> | |
128 | <td width="86%" class="table_cells">Points to the final parse position (i.e | |
129 | The parser recognized and processed the input up to this point)</td> | |
130 | </tr> | |
131 | <tr> | |
132 | <td width="14%" class="table_cells"><b>hit</b></td> | |
133 | <td width="86%" class="table_cells">True if parsing is successful. This may | |
134 | be full: the parser consumed all the input, or partial: the parser consumed | |
135 | only a portion of the input.</td> | |
136 | </tr> | |
137 | <tr> | |
138 | <td width="14%" class="table_cells"><b>full</b></td> | |
139 | <td width="86%" class="table_cells">True when we have a full match (i.e The | |
140 | parser consumed all the input).</td> | |
141 | </tr> | |
142 | <tr> | |
143 | <td width="14%" class="table_cells"><b>length</b></td> | |
144 | <td width="86%" class="table_cells">The number of characters consumed by the | |
145 | parser. This is valid only if we have a successful match (either partial | |
146 | or full). </td> | |
147 | </tr> | |
148 | </table> | |
149 | <h2><a name="phrase_scanner_t" id="phrase_scanner_t"></a><img src="theme/lens.gif" width="15" height="16"> | |
150 | The phrase_scanner_t and wide_phrase_scanner_t</h2> | |
151 | <p>For convenience, Spirit declares these typedefs:</p> | |
152 | <pre> | |
153 | <span class="keyword">typedef</span> scanner<span class="special"><</span><span class="keyword">char const</span><span class="special">*,</span> unspecified<span class="special">></span> phrase_scanner_t<span class="special">;</span> | |
154 | <span class="keyword">typedef</span> scanner<span class="special"><</span><span class="keyword">wchar_t const</span><span class="special">*,</span> <span class="identifier">unspecified</span><span class="special">></span> wide_phrase_scanner_t<span class="special">;</span> | |
155 | </pre> | |
156 | <p>These are the exact scanner types used by Spirit on calls to the parse function | |
157 | passing in a <tt>char const*</tt> (C string) or a <tt>wchar_t const*</tt> (wide | |
158 | string) as the first parameter and a <tt>space_p</tt> as skip-parser (the third | |
159 | parameter). For instance, we can use these typedefs to declare some rules. Example:</p> | |
160 | <pre> rule<span class="special"><</span>phrase_scanner_t<span class="special">> </span><span class="identifier">my_rule</span><span class="special">; | |
161 | </span><span class="identifier">parse</span><span class="special">(</span><span class="string">"abrakadabra"</span><span class="special">, </span><span class="identifier">my_rule</span><span class="special">,</span> <span class="identifier">space_p</span><span class="special">);</span></pre> | |
162 | <h2><img src="theme/lens.gif" width="15" height="16"> Direct parsing with Iterators</h2> | |
163 | <p>The free parse functions make it easy for us. By using them, we need not bother | |
164 | with the scanner intricacies. The free parse functions hide the dirty details. | |
165 | However, sometime in the future, we will need to get under the hood. It's nice | |
166 | that we know what we are dealing with when that need comes. We will need to | |
167 | go low-level and call the parser's parse member function directly. </p> | |
168 | <p>If we wish to work on the <b>character level</b>, the procedure is quite simple:</p> | |
169 | <pre><span class=identifier> </span><span class=identifier>scanner</span><span class=special><</span><span class=identifier>IteratorT</span><span class=special>> </span><span class=identifier>scan</span><span class=special>(</span><span class=identifier>first</span><span class=special>, </span><span class=identifier>last</span><span class=special>); | |
170 | ||
171 | </span><span class=keyword>if </span><span class=special>(</span><span class=identifier>p</span><span class=special>.</span><span class=identifier>parse</span><span class=special>(</span><span class=identifier>scan</span><span class=special>)) | |
172 | </span><span class=special>{ | |
173 | </span><span class=comment>// Parsed successfully. If first == last, then we have | |
174 | // a full parse, the parser recognized the input in whole. | |
175 | </span><span class=special>} | |
176 | </span><span class=keyword>else | |
177 | </span><span class=special>{ | |
178 | </span><span class=comment>// Parsing failure. The parser failed to recognize the input | |
179 | </span><span class=special>}</span></pre> | |
180 | <table width="80%" border="0" align="center"> | |
181 | <tr> | |
182 | <td class="note_box"><img src="theme/alert.gif" width="16" height="16"> <strong>The | |
183 | scanner position on an unsucessful match</strong><br> <br> | |
184 | On a successful match, the input is advanced accordingly. But what happens | |
185 | on an unsuccessful match? Be warned. It might be intuitive to think that | |
186 | the scanner position is reset to its initial position prior to parsing. | |
187 | No, the position is not reset. On an unsuccessful match, the position of | |
188 | the scanner is <strong>undefined</strong>! Usually, it is positioned at | |
189 | the farthest point where the error was found somewhere down the recursive | |
190 | descent. If this behavior is not desired, you may need to position the scanner | |
191 | yourself. The <a href="numerics.html#scanner_save">example in the numerics | |
192 | chapter</a> illustrates how the scanner position can be saved and later | |
193 | restored.</td> | |
194 | </tr> | |
195 | </table> | |
196 | <p>Where <tt>p</tt> is the parser we want to use, and <tt>first</tt>/<tt>last</tt> | |
197 | are the iterator pairs referring to the input. We just create a scanner given | |
198 | the iterators. The scanner type we will use here uses the default <tt>scanner_policies<></tt>.</p> | |
199 | <p>The situation is a bit more complex when we wish to work on the <b>phrase level</b>:</p> | |
200 | <pre><span class=special> </span><span class=keyword>typedef </span><span class=identifier>skip_parser_iteration_policy</span><span class=special><</span><span class=identifier>SkipT</span><span class=special>> </span><span class=identifier>iter_policy_t</span><span class=special>; | |
201 | </span><span class=keyword>typedef </span><span class=identifier>scanner_policies</span><span class=special><</span><span class=identifier>iter_policy_t</span><span class=special>> </span><span class=identifier>scanner_policies_t</span><span class=special>; | |
202 | </span><span class=keyword>typedef </span><span class=identifier>scanner</span><span class=special><</span><span class=identifier>IteratorT</span><span class=special>, </span><span class=identifier>scanner_policies_t</span><span class=special>> </span><span class=identifier>scanner_t</span><span class=special>; | |
203 | ||
204 | </span><span class=special> </span><span class=identifier>iter_policy_t </span><span class=identifier>iter_policy</span><span class=special>(</span><span class=identifier>skip</span><span class=special>); | |
205 | </span><span class=identifier>scanner_policies_t </span><span class=identifier>policies</span><span class=special>(</span><span class=identifier>iter_policy</span><span class=special>); | |
206 | </span><span class=identifier>scanner_t </span><span class=identifier>scan</span><span class=special>(</span><span class=identifier>first</span><span class=special>, </span><span class=identifier>last</span><span class=special>, </span><span class=identifier>policies</span><span class=special>); | |
207 | </span> | |
208 | <span class=keyword>if </span><span class=special>(</span><span class=identifier>p</span><span class=special>.</span><span class=identifier>parse</span><span class=special>(</span><span class=identifier>scan</span><span class=special>)) | |
209 | </span><span class=special>{ | |
210 | </span><span class=comment>// Parsed successfully. If first == last, then we have | |
211 | // a full parse, the parser recognized the input in whole. | |
212 | </span><span class=special>} | |
213 | </span><span class=keyword>else | |
214 | </span><span class=special>{ | |
215 | </span><span class=comment>// Parsing failure. The parser failed to recognize the input | |
216 | </span><span class=special>}</span></pre> | |
217 | <p>Where <tt>SkipT</tt> is the type of the skip-parser, <tt>skip</tt>. Again, | |
218 | <tt>p</tt> is the parser we want to use, and <tt>first</tt>/<tt>last</tt> are | |
219 | the iterator pairs referring to the input. Given a skip-parser type <tt>SkipT</tt>, | |
220 | <span class=identifier><tt>skip_parser_iteration_policy</tt></span> creates | |
221 | a scanner iteration policy that skips over portions that are recognized by the | |
222 | skip-parser. This may then be used to create a scanner. The <tt>scanner_policies</tt> | |
223 | class wraps all scanner related policies including the iteration policies.</p> | |
224 | <h2><a name="lexeme_scanner"></a>lexeme_scanner</h2> | |
225 | <p>When switching from phrase level to character level parsing, the <tt>lexeme_d</tt> | |
226 | (see <a href="directives.html">directives.html</a>) does its magic by disabling | |
227 | the skipping of white spaces. This is done by tweaking the <a href="scanner.html">scanner</a>. | |
228 | However, when we do this, all parsers inside the lexeme gets a transformed scanner | |
229 | type. This should not be a problem in most cases. However, when rules are called | |
230 | inside the <tt>lexeme_d</tt>, the compiler will choke if the rule does not have | |
231 | the proper scanner type. If a rule must be used inside a <tt>lexeme_d</tt>, | |
232 | the rule's type must be:</p> | |
233 | <pre> <span class=identifier>rule</span><span class=special><</span><span class=identifier>lexeme_scanner</span><span class="special"><</span><span class=identifier>ScannerT</span><span class=special>>::</span><span class="identifier">type</span><span class=special>> </span>r<span class=special>;</span></pre> | |
234 | <p>where <span class=identifier><tt>ScannerT</tt></span> is the actual type of | |
235 | the scanner used. Take note that <tt>lexeme_scanner</tt> will only work for phrase level scanners. </p> | |
236 | <h2><a name="as_lower_scanner"></a>as_lower_scanner</h2> | |
237 | <p>Similarly, the <tt>as_lower_d</tt> does its work by filtering and converting | |
238 | all characters received from the scanner to lower case. This is also done by | |
239 | tweaking the <a href="scanner.html">scanner</a>. Then again, all parsers inside | |
240 | the <tt>as_lower_d</tt> gets a transformed scanner type. If a rule must be used | |
241 | inside a <tt>as_lower_d</tt>, the rule's type must be:</p> | |
242 | <pre> <span class=identifier>rule</span><span class=special><</span><span class=identifier>as_lower_scanner</span><span class="special"><</span><span class=identifier>ScannerT</span><span class=special>>::</span><span class="identifier">type</span><span class=special>> </span>r<span class=special>;</span></pre> | |
243 | <p>where <span class=identifier><tt>ScannerT</tt></span> is the actual type of | |
244 | the scanner used. </p> | |
245 | <table width="80%" border="0" align="center"> | |
246 | <tr> | |
247 | <td class="note_box"><img src="theme/bulb.gif" width="13" height="18"> See | |
248 | the techniques section for an <a href="techniques.html#multiple_scanner_support">example</a> | |
249 | of a <a href="grammar.html">grammar</a> using a <a href="rule.html#multiple_scanner_support">multiple | |
250 | scanner enabled rule</a>, <a href="scanner.html#lexeme_scanner">lexeme_scanner</a> | |
251 | and <a href="scanner.html#as_lower_scanner">as_lower_scanner.</a></td> | |
252 | </tr> | |
253 | </table> | |
254 | <h3><a name="no_actions_scanner"></a>no_actions_scanner</h3> | |
255 | <p>Again, <tt>no_actions_d</tt> directive tweaks the scanner to disable firing | |
256 | semantic actions. Like before, all parsers inside the <tt>no_actions_d</tt> | |
257 | gets a transformed scanner type. If a rule must be used inside a <tt>no_actions_d</tt>, | |
258 | the rule's type must be:</p> | |
259 | <pre> <span class=identifier>rule</span><span class=special><</span>no_actions_scanner<span class="special"><</span><span class=identifier>ScannerT</span><span class=special>>::</span><span class="identifier">type</span><span class=special>> </span>r<span class=special>;</span></pre> | |
260 | <p>where <tt>ScannerT</tt> is the actual type of the scanner used. <span class=special></span></p> | |
261 | <table width="80%" border="0" align="center"> | |
262 | <tr> | |
263 | <td class="note_box"><img src="theme/note.gif" width="16" height="16"> Be | |
264 | sure to add "<tt>typename</tt>" before <tt><span class=identifier><tt>lexeme_scanner</tt>, | |
265 | <tt>as_lower_scanner</tt></span></tt> and <tt>no_actions_scanner</tt> when | |
266 | these are used inside a template class or function.</td> | |
267 | </tr> | |
268 | </table> | |
269 | <p><img src="theme/lens.gif" width="15" height="16"> See <a href="../example/fundamental/no_actions.cpp">no_actions.cpp</a>. This is part of the Spirit distribution.</p> | |
270 | <table border="0"> | |
271 | <tr> | |
272 | <td width="10"></td> | |
273 | <td width="30"><a href="../index.html"><img src="theme/u_arr.gif" border="0"></a></td> | |
274 | <td width="30"><a href="directives.html"><img src="theme/l_arr.gif" border="0"></a></td> | |
275 | <td width="30"><a href="grammar.html"><img src="theme/r_arr.gif" border="0"></a></td> | |
276 | </tr> | |
277 | </table> | |
278 | <br> | |
279 | <hr size="1"> | |
280 | <p class="copyright">Copyright © 1998-2003 Joel de Guzman<br> | |
281 | <br> | |
282 | <font size="2">Use, modification and distribution is subject to the Boost Software | |
283 | License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at | |
284 | http://www.boost.org/LICENSE_1_0.txt)</font></p> | |
285 | <p> </p> | |
286 | <p> </p> | |
287 | </body> | |
288 | </html> |