]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
2 | <html> | |
3 | <head> | |
4 | <meta content= | |
5 | "HTML Tidy for Windows (vers 1st February 2003), see www.w3.org" | |
6 | name="generator"> | |
7 | <title> | |
8 | Basic Concepts | |
9 | </title> | |
10 | <meta http-equiv="Content-Type" content="text/html; charset=us-ascii"> | |
11 | <link rel="stylesheet" href="theme/style.css" type="text/css"> | |
12 | </head> | |
13 | <body> | |
14 | <table width="100%" border="0" background="theme/bkd2.gif" cellspacing="2"> | |
15 | <tr> | |
16 | <td width="10"></td> | |
17 | <td width="85%"> | |
18 | <font size="6" face="Verdana, Arial, Helvetica, sans-serif"><b>Basic | |
19 | Concepts</b></font> | |
20 | </td> | |
21 | <td width="112"> | |
22 | <a href="http://spirit.sf.net"><img src="theme/spirit.gif" | |
23 | width="112" height="48" align="right" border="0"></a> | |
24 | </td> | |
25 | </tr> | |
26 | </table><br> | |
27 | <table border="0"> | |
28 | <tr> | |
29 | <td width="10"></td> | |
30 | <td width="30"> | |
31 | <a href="../index.html"><img src="theme/u_arr.gif" border="0"></a> | |
32 | </td> | |
33 | <td width="30"> | |
34 | <a href="quick_start.html"><img src="theme/l_arr.gif" border="0"></a> | |
35 | </td> | |
36 | <td width="30"> | |
37 | <a href="organization.html"><img src="theme/r_arr.gif" border="0"> | |
38 | </a> | |
39 | </td> | |
40 | </tr> | |
41 | </table> | |
42 | <p> | |
43 | There are a few fundamental concepts that need to be understood well: 1) | |
44 | The <strong>Parser</strong>, 2) <strong>Match</strong>, 3) The | |
45 | <strong>Scanner</strong>, and 4) <strong>Semantic Actions</strong>. These | |
46 | basic concepts interact with one another, and the functionalities of each | |
47 | interweave throughout the framework to make it one coherent whole. | |
48 | </p> | |
49 | <table width="48%" border="0" align="center"> | |
50 | <tr> | |
51 | <td height="211"> | |
52 | <img src="theme/intro1.png"> | |
53 | </td> | |
54 | </tr> | |
55 | </table> | |
56 | <h2> | |
57 | The Parser | |
58 | </h2> | |
59 | <p> | |
60 | Central to the framework is the parser. The parser does the actual work | |
61 | of recognizing a linear input stream of data read sequentially from start | |
62 | to end by the scanner. The parser attempts to match the input following a | |
63 | well-defined set of specifications known as grammar rules. The parser | |
64 | reports the success or failure to its client through a match object. When | |
65 | successful, the parser calls a client-supplied semantic action. Finally, | |
66 | the semantic action extracts structural information depending on the data | |
67 | passed by the parser and the hierarchical context of the parser it is | |
68 | attached to. | |
69 | </p> | |
70 | <p> | |
71 | Parsers come in different flavors. The Spirit framework comes bundled | |
72 | with an extensive set of pre-defined parsers that perform various parsing | |
73 | tasks from the trivial to the complex. The parser, as a concept, has a | |
74 | public conceptual interface contract. Following the contract, anyone can | |
75 | write a conforming parser that will play along well with the framework's | |
76 | predefined components. We shall provide a blueprint detailing the | |
77 | conceptual interface of the parser later. | |
78 | </p> | |
79 | <p> | |
80 | Clients of the framework generally do not need to write their own | |
81 | hand-coded parsers at all. Spirit has an immense repertoire of | |
82 | pre-defined parsers covering all aspects of syntax and semantic analysis. | |
83 | We shall examine this repertoire of parsers in the following sections. In | |
84 | the rare case where a specific functionality is not available, it is | |
85 | extremely easy to write a user-defined parser. The ease in writing a | |
86 | parser entity is the main reason for Spirit's extensibility. | |
87 | </p> | |
88 | <h2> | |
89 | Primitives and Composites | |
90 | </h2> | |
91 | <p> | |
92 | Spirit parsers fall into two categories: <b>primitives</b> and | |
93 | <b>composites</b>. These two categories are more or less synonymous to | |
94 | terminals and non-terminals in parsing lingo. Primitives are | |
95 | non-decomposable atomic units. Composites on the other hand are parsers | |
96 | that are composed of other parsers which can in turn be a primitive or | |
97 | another composite. To illustrate, consider the Spirit expression: | |
98 | </p> | |
99 | ||
100 | <pre><code><font color="#000000"> </font></code><code><span class="identifier">real_p</span> <span class= | |
101 | "special">>></span> <span class="special">*(</span><span class= | |
102 | "literal">','</span> <span class="special">>></span> <span class= | |
103 | "identifier">real_p</span><span class="special">)</span></code> | |
104 | </pre> | |
105 | <p> | |
106 | <tt><tt>real_p</tt></tt> is a primitive parser that can parse real | |
107 | numbers. The quoted comma <tt class="quotes">','</tt> in the expression | |
108 | is a shortcut and is equivalent to <tt>ch_p<span class= | |
109 | "operators">(</span><span class="quotes">','</span><span class= | |
110 | "operators">)</span></tt>, which is another primitive parser that | |
111 | recognizes single characters. | |
112 | </p> | |
113 | <p> | |
114 | The expression above corresponds to the following parse tree: | |
115 | </p> | |
116 | <table width="29%" border="0" align="center"> | |
117 | <tr> | |
118 | <td> | |
119 | <img src="theme/intro7.png"> | |
120 | </td> | |
121 | </tr> | |
122 | </table> | |
123 | <p> | |
124 | The expression: | |
125 | </p> | |
126 | ||
127 | <pre><code><font color="#000000"> </font></code><span class= | |
128 | "literal">','</span> <span class="special">>></span> <span class= | |
129 | "identifier">real_p</span> | |
130 | </pre> | |
131 | <p> | |
132 | composes a <b>sequence</b> parser. The <tt>sequence</tt> parser is a | |
133 | composite parser comprising two parsers: the one on its left hand side | |
134 | (lhs), <tt>ch_p<span class="operators">(</span><span class= | |
135 | "quotes">','</span><span class="operators">)</span></tt> ; and the other | |
136 | on its right hand side (rhs), <tt>real_p</tt>. This composite parser, | |
137 | when called, calls its lhs and rhs in sequence and reports a successful | |
138 | match only if both are successful. | |
139 | </p> | |
140 | <table width="14%" border="0" align="center"> | |
141 | <tr> | |
142 | <td> | |
143 | <img src="theme/intro2.png"> | |
144 | </td> | |
145 | </tr> | |
146 | </table> | |
147 | <p> | |
148 | The <tt>sequence</tt> parser is a binary composite. It is composed of two | |
149 | parsers. There are unary composites as well. Unary composites hold only a | |
150 | single subject. Like the binary composite, the unary composite may change | |
151 | the behavior of its embedded subject. One particular example is the | |
152 | <b>Kleene star</b>. The Kleene star, when called to parse, calls its sole | |
153 | subject zero or more times. "Zero or more" implies that the Kleene star | |
154 | always returns a successful match, possibly matching the null string: "". | |
155 | </p> | |
156 | <p> | |
157 | The expression: | |
158 | </p> | |
159 | ||
160 | <pre><code><font color="#000000"> </font></code><code><span class= | |
161 | "special">*(</span><span class="literal">','</span> <span class= | |
162 | "special">>></span> <span class= | |
163 | "identifier">real_p</span><span class="special">)</span></code> | |
164 | </pre> | |
165 | <p> | |
166 | wraps the whole sequence composite above inside a <tt>kleene_star</tt>. | |
167 | </p> | |
168 | <table width="17%" border="0" align="center"> | |
169 | <tr> | |
170 | <td> | |
171 | <img src="theme/intro3.png"> | |
172 | </td> | |
173 | </tr> | |
174 | </table> | |
175 | <p> | |
176 | Finally, the full expression composes a <tt>real_p</tt> primitive parser | |
177 | and the <tt>kleene_star</tt> we have above into another higher level | |
178 | <tt>sequence</tt> parser composite. | |
179 | </p> | |
180 | <table width="34%" border="0" align="center"> | |
181 | <tr> | |
182 | <td> | |
183 | <img src="theme/intro4.png"> | |
184 | </td> | |
185 | </tr> | |
186 | </table> | |
187 | <p> | |
188 | A few simple classes, when composed and structured in a hierarchy, form a | |
189 | very powerful object-oriented recursive-descent parsing engine. These | |
190 | classes provide the infrastructure needed for the construction of | |
191 | more-complex parsers. The final parser composite is a non-deterministic | |
192 | recursive-descent parser with infinite look-ahead. | |
193 | </p> | |
194 | <p> | |
195 | Top-down descent traverses the hierarchy. The outer <tt>sequence</tt> | |
196 | calls the leftmost <tt>real_p</tt> parser. If successful, the | |
197 | <tt>kleene_star</tt> is called next. The <tt>kleene_star</tt> calls the | |
198 | inner <tt>sequence</tt> repeatedly in a loop until it fails to match, or | |
199 | the input is exhausted. Inside, <tt>ch_p(',')</tt> and then | |
200 | <tt>real_p</tt> are called in sequence. The following diagram illustrates | |
201 | what is happening, somewhat reminiscent of Pascal syntax diagrams. | |
202 | </p> | |
203 | <table width="37%" border="0" align="center"> | |
204 | <tr> | |
205 | <td> | |
206 | <img src="theme/intro5.png"> | |
207 | </td> | |
208 | </tr> | |
209 | </table> | |
210 | <p> | |
211 | The flexibility of object embedding and composition combined with | |
212 | recursion opens up a unique approach to parsing. Subclasses are free to | |
213 | form aggregates and algorithms of arbitrary complexity. Complex parsers | |
214 | can be created with the composition of only a few primitive classes. | |
215 | </p> | |
216 | <p> | |
217 | The framework is designed to be fully open-ended and extensible. New | |
218 | primitives or composites, from the trivial to the complex, may be added | |
219 | any time. Composition happens (statically) at compile time. This is | |
220 | possible through the expressive flexibility of C++ expression templates | |
221 | and template meta-programming. | |
222 | </p> | |
223 | <p> | |
224 | The result is a composite composed of primitives and smaller composites. | |
225 | This embedding strategy gives us the ability to build hierarchical | |
226 | structures that fully model EBNF expressions of arbitrary complexity. | |
227 | Later on, we shall see more primitive and composite building blocks. | |
228 | </p> | |
229 | <h2> | |
230 | The Scanner | |
231 | </h2> | |
232 | <p> | |
233 | Like the parser, the scanner is also an abstract concept. The task of the | |
234 | scanner is to feed the sequential input data stream to the parser. The | |
235 | scanner is composed of two STL conforming forward iterators, first and | |
236 | last, where first is held by reference and last, by value. The first | |
237 | iterator is held by reference to allow re-positioning by the parser. A | |
238 | set of policies governs how the scanner behaves. Parsers extract data | |
239 | from the scanner and position the iterator appropriately through its | |
240 | member functions. | |
241 | </p> | |
242 | <p> | |
243 | Knowledge of the intricacies of these policies is not required at all in | |
244 | most cases. However, knowledge of the scanner's basic API is required to | |
245 | write fully-conforming Spirit parsers. The scanner's API will be outlined | |
246 | in a separate section. In addition, for the power users and the | |
247 | adventurous among us, a full section will be devoted to covering the | |
248 | scanner policies. The scanner policies make Spirit very flexible and | |
249 | extensible. For instance, some of the policies may be modified to filter | |
250 | data. A practical example is a scanner policy that does not distinguish | |
251 | upper and lower case whereby making it useful for parsing case | |
252 | insensitive input. Another example is a scanner policy that strips white | |
253 | spaces from the input. | |
254 | </p> | |
255 | <h2> | |
256 | The Match | |
257 | </h2> | |
258 | <p> | |
259 | The parser has a conceptual parse member function taking in a scanner and | |
260 | returning a match object. The primary function of the match object is to | |
261 | report parsing success (or failure) back to the parser's caller; i.e., it | |
262 | evaluates to true if the parse function is successful, false otherwise. | |
263 | If the parse is successful, the match object may also be queried to | |
264 | report the number of characters matched (using <tt>match.length()</tt>). | |
265 | The length is non-negative if the match is successful, and the typical | |
266 | length of a parse failure is -1. A zero length is perfectly valid and | |
267 | still represents a successful match. | |
268 | </p> | |
269 | <p> | |
270 | Parsers may have attribute data associated with it. For example, the | |
271 | real_p parser has a numeric datum associated with it. This attribute is | |
272 | the parsed number. This attribute is passed on to the returned match | |
273 | object. The match object may be queried to get this attribute. This datum | |
274 | is valid only when the match is successful. | |
275 | </p> | |
276 | <h2> | |
277 | Semantic Actions | |
278 | </h2> | |
279 | <p> | |
280 | A composite parser forms a hierarchy. Parsing proceeds from the topmost | |
281 | parent parser which delegates and apportions the parsing task to its | |
282 | children recursively to its children's children and so on until a | |
283 | primitive is reached. By attaching semantic actions to various points in | |
284 | this hierarchy, in effect we can transform the flat linear input stream | |
285 | into a structured representation. This is essentially what parsers do. | |
286 | </p> | |
287 | <p> | |
288 | Recall our example above: | |
289 | </p> | |
290 | ||
291 | <pre><code><font color="#000000"> </font></code><code><span class= | |
292 | "identifier">real_p</span> <span class= | |
293 | "special">>></span> <span class="special">*(</span><span class= | |
294 | "literal">','</span> <span class="special">>></span> <span class= | |
295 | "identifier">real_p</span><span class="special">)</span></code> | |
296 | </pre> | |
297 | <p> | |
298 | By hooking a function (or functor) into the real_p parsers, we can | |
299 | extract the numbers from the input: | |
300 | </p> | |
301 | ||
302 | <pre><code><font color="#000000"> </font></code><span class= | |
303 | "identifier">real_p</span><span class="special">[&</span><span class= | |
304 | "identifier">f</span><span class="special">]</span> <span class= | |
305 | "special">>></span> <span class="special">*(</span><span class= | |
306 | "literal">','</span> <span class="special">>></span> <span class= | |
307 | "identifier">real_p</span><span class="special">[&</span><span class= | |
308 | "identifier">f</span><span class="special">])</span> | |
309 | </pre> | |
310 | <table width="41%" border="0" align="center"> | |
311 | <tr> | |
312 | <td> | |
313 | <img src="theme/intro6.png"> | |
314 | </td> | |
315 | </tr> | |
316 | </table> | |
317 | ||
318 | <p> where <tt>f</tt> is a function that takes in a single argument. The <tt><span class="operators">[&</span>f<span class= | |
319 | "operators">]</span></tt> hooks the parser with the function such that when | |
320 | <tt>real_p</tt> recognizes a valid number, the function <tt>f</tt> is called. | |
321 | It is up to the function then to do what is appropriate. For example, it can | |
322 | stuff the numbers in a vector. Or perhaps, if the grammar is changed slightly | |
323 | by replacing <tt class="quotes">','</tt> with <tt class="quotes">'+'</tt>, then | |
324 | we have a primitive calculator that computes sums. The function <tt>f</tt> then | |
325 | can then be made to add all incoming numbers.<br> | |
326 | </p> | |
327 | <table border="0"> | |
328 | <tr> | |
329 | <td width="10"></td> | |
330 | <td width="30"> | |
331 | <a href="../index.html"><img src="theme/u_arr.gif" border="0"></a> | |
332 | </td> | |
333 | <td width="30"> | |
334 | <a href="quick_start.html"><img src="theme/l_arr.gif" border="0"></a> | |
335 | </td> | |
336 | <td width="30"> | |
337 | <a href="organization.html"><img src="theme/r_arr.gif" border="0"> | |
338 | </a> | |
339 | </td> | |
340 | </tr> | |
341 | </table><br> | |
342 | <hr size="1"> | |
343 | <p class="copyright"> | |
344 | Copyright © 1998-2003 Joel de Guzman<br> | |
345 | <br> | |
346 | <font size="2">Use, modification and distribution is subject to the Boost | |
347 | Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or | |
348 | copy at http://www.boost.org/LICENSE_1_0.txt)</font> | |
349 | </p> | |
350 | <p> | |
351 | ||
352 | </p> | |
353 | </body> | |
354 | </html> |