]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | [/============================================================================== |
2 | Copyright (C) 2001-2011 Joel de Guzman | |
3 | Copyright (C) 2001-2011 Hartmut Kaiser | |
4 | ||
5 | Distributed under the Boost Software License, Version 1.0. (See accompanying | |
6 | file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) | |
7 | ===============================================================================/] | |
8 | ||
9 | [section Introduction] | |
10 | ||
11 | Boost Spirit is an object-oriented, recursive-descent parser and | |
12 | output generation library for C++. It allows you to write grammars and | |
13 | format descriptions using a format similar to Extended Backus Naur | |
14 | Form (EBNF)[footnote [@http://www.cl.cam.ac.uk/%7Emgk25/iso-14977.pdf | |
15 | ISO-EBNF]] directly in C++. These inline grammar | |
16 | specifications can mix freely with other C++ code and, thanks to the | |
17 | generative power of C++ templates, are immediately executable. In | |
18 | retrospect, conventional compiler-compilers or parser-generators have | |
19 | to perform an additional translation step from the source EBNF code to | |
20 | C or C++ code. | |
21 | ||
22 | The syntax and semantics of the libraries' API directly form domain-specific | |
23 | embedded languages (DSEL). In fact, Spirit exposes 3 different DSELs to the | |
24 | user: | |
25 | ||
26 | * one for creating parser grammars, | |
27 | * one for the specification of the required tokens to be used for parsing, | |
28 | * and one for the description of the required output formats. | |
29 | ||
30 | Since the target input grammars and output formats are written entirely in C++ | |
31 | we do not need any separate tools to compile, preprocess or integrate those | |
32 | into the build process. __spirit__ allows seamless integration of the parsing | |
33 | and output generation process with other C++ code. This often allows for | |
34 | simpler and more efficient code. | |
35 | ||
36 | Both the created parsers and generators are fully attributed, which allows you | |
37 | to easily build and handle hierarchical data structures in memory. These data | |
38 | structures resemble the structure of the input data and can directly be used | |
39 | to generate arbitrarily-formatted output. | |
40 | ||
41 | The [link spirit.spiritstructure figure] below depicts the overall structure | |
42 | of the Boost Spirit library. The library consists of 4 major parts: | |
43 | ||
44 | * __classic__: This is the almost-unchanged code base taken from the | |
45 | former Boost Spirit V1.8 distribution. It has been moved into the namespace | |
46 | boost::spirit::classic. A special compatibility layer has been added to | |
47 | ensure complete compatibility with existing code using Spirit V1.8. | |
48 | * __qi__: This is the parser library allowing you to build recursive | |
49 | descent parsers. The exposed domain-specific language can be used to describe | |
50 | the grammars to implement, and the rules for storing the parsed information. | |
51 | * __lex__: This is the library usable to create tokenizers (lexers). The | |
52 | domain-specific language exposed by __lex__ allows you to define regular | |
53 | expressions used to match tokens (create token definitions), associate these | |
54 | regular expressions with code to be executed whenever they are matched, and | |
55 | to add the token definitions to the lexical analyzer. | |
56 | * __karma__: This is the generator library allowing you to create code for | |
57 | recursive descent, data type-driven output formatting. The exposed | |
58 | domain-specific language is almost equivalent to the parser description language | |
59 | used in __qi__, except that it is used to describe the required output | |
60 | format to generate from a given data structure. | |
61 | ||
62 | [fig spiritstructure.png..The overall structure of the Boost Spirit library..spirit.spiritstructure] | |
63 | ||
64 | ||
65 | The three components, __qi__, __karma__ and __lex__, are designed to be used | |
66 | either stand alone, or together. The general methodology is to use the token | |
67 | sequence generated by __lex__ as the input for a parser generated by __qi__. | |
68 | On the opposite side of the equation, the hierarchical data structures generated | |
69 | by __qi__ are used for the output generators created using __karma__. | |
70 | However, there is nothing to stop you from using any of these components all | |
71 | by themselves. | |
72 | ||
73 | The [link spirit.spiritkarmaflow figure] below shows the typical data flow of | |
74 | some input being converted to some internal representation. After some | |
75 | (optional) transformation these data are converted back into some different, | |
76 | external representation. The picture highlights Spirit's place in this data | |
77 | transformation flow. | |
78 | ||
79 | [fig spiritkarmaflow.png..The place of __qi__ and __karma__ in a data transformation flow of a typical application..spirit.spiritkarmaflow] | |
80 | ||
81 | [heading A Quick Overview of Parsing with __qi__] | |
82 | ||
83 | __qi__ is Spirit's sublibrary dealing with generating parsers based on a given | |
84 | target grammar (essentially a format description of the input data to read). | |
85 | ||
86 | A simple EBNF grammar snippet: | |
87 | ||
88 | group ::= '(' expression ')' | |
89 | factor ::= integer | group | |
90 | term ::= factor (('*' factor) | ('/' factor))* | |
91 | expression ::= term (('+' term) | ('-' term))* | |
92 | ||
93 | is approximated using facilities of Spirit's /Qi/ sublibrary as seen in this | |
94 | code snippet: | |
95 | ||
96 | group = '(' >> expression >> ')'; | |
97 | factor = integer | group; | |
98 | term = factor >> *(('*' >> factor) | ('/' >> factor)); | |
99 | expression = term >> *(('+' >> term) | ('-' >> term)); | |
100 | ||
101 | Through the magic of expression templates, this is perfectly valid and | |
102 | executable C++ code. The production rule `expression` is, in fact, an object that | |
103 | has a member function `parse` that does the work given a source code written in | |
104 | the grammar that we have just declared. Yes, it's a calculator. We shall | |
105 | simplify for now by skipping the type declarations and the definition of the | |
106 | rule `integer` invoked by `factor`. Now, the production rule `expression` in our | |
107 | grammar specification, traditionally called the `start` symbol, can recognize | |
108 | inputs such as: | |
109 | ||
110 | 12345 | |
111 | -12345 | |
112 | +12345 | |
113 | 1 + 2 | |
114 | 1 * 2 | |
115 | 1/2 + 3/4 | |
116 | 1 + 2 + 3 + 4 | |
117 | 1 * 2 * 3 * 4 | |
118 | (1 + 2) * (3 + 4) | |
119 | (-1 + 2) * (3 + -4) | |
120 | 1 + ((6 * 200) - 20) / 6 | |
121 | (1 + (2 + (3 + (4 + 5)))) | |
122 | ||
123 | Certainly we have modified the original EBNF syntax. This is done to | |
124 | conform to C++ syntax rules. Most notably we see the abundance of | |
125 | shift >> operators. Since there are no 'empty' operators in C++, it is | |
126 | simply not possible to write something like: | |
127 | ||
128 | a b | |
129 | ||
130 | as seen in math syntax, for example, to mean multiplication or, in our case, | |
131 | as seen in EBNF syntax to mean sequencing (b should follow a). __qi__ | |
132 | uses the shift `>>` operator instead for this purpose. We take the `>>` operator, | |
133 | with arrows pointing to the right, to mean "is followed by". Thus we write: | |
134 | ||
135 | a >> b | |
136 | ||
137 | The alternative operator `|` and the parentheses `()` remain as is. The | |
138 | assignment operator `=` is used in place of EBNF's `::=`. Last but not least, | |
139 | the Kleene star `*`, which in this case is a postfix operator in EBNF becomes a | |
140 | prefix. Instead of: | |
141 | ||
142 | a* //... in EBNF syntax, | |
143 | ||
144 | we write: | |
145 | ||
146 | *a //... in Spirit. | |
147 | ||
148 | since there are no postfix stars, `*`, in C/C++. Finally, we terminate each | |
149 | rule with the ubiquitous semi-colon, `;`. | |
150 | ||
151 | ||
152 | [heading A Quick Overview of Output Generation with __karma__] | |
153 | ||
154 | Spirit not only allows you to describe the structure of the input, it also enables | |
155 | the specification of the output format for your data in a similar way, and based | |
156 | on a single syntax and compatible semantics. | |
157 | ||
158 | Let's assume we need to generate a textual representation from a simple data | |
159 | structure such as a `std::vector<int>`. Conventional code probably would look like: | |
160 | ||
161 | std::vector<int> v (initialize_and_fill()); | |
162 | std::vector<int>::iterator end = v.end(); | |
163 | for (std::vector<int>::iterator it = v.begin(); it != end; ++it) | |
164 | std::cout << *it << std::endl; | |
165 | ||
166 | which is not very flexible and quite difficult to maintain when it comes to | |
167 | changing the required output format. Spirit's sublibrary /Karma/ allows you to | |
168 | specify output formats for arbitrary data structures in a very flexible way. | |
169 | The following snippet is the /Karma/ format description used to create the | |
170 | same output as the traditional code above: | |
171 | ||
172 | *(int_ << eol) | |
173 | ||
174 | Here are some more examples of format descriptions for different output | |
175 | representations of the same `std::vector<int>`: | |
176 | ||
177 | [table Different output formats for `std::vector<int>` | |
178 | [ [Format] [Example] [Description] ] | |
179 | [ [`'[' << *(int_ << ',') << ']'`] [`[1,8,10,]`] [Comma separated list of integers] ] | |
180 | [ [`*('(' << int_ << ')' << ',')`] [`(1),(8),(10),`] [Comma separated list of integers in parenthesis] ] | |
181 | [ [`*hex`] [`18a`] [A list of hexadecimal numbers] ] | |
182 | [ [`*(double_ << ',')`] [`1.0,8.0,10.0,`] [A list of floating point numbers] ] | |
183 | ] | |
184 | ||
185 | We will see later in this documentation how it is possible to avoid printing | |
186 | the trailing `','`. | |
187 | ||
188 | Overall, the syntax is similar to __qi__ with the exception that we use the `<<` | |
189 | operator for output concatenation. This should be easy to understand as it | |
190 | follows the conventions used in the Standard's I/O streams. | |
191 | ||
192 | Another important feature of __karma__ allows you to fully decouple the data | |
193 | type from the output format. You can use the same output format with different | |
194 | data types as long as these conform conceptually. The next table gives some | |
195 | related examples. | |
196 | ||
197 | [table Different data types usable with the output format `*(int_ << eol)` | |
198 | [ [Data type] [Description] ] | |
199 | [ [`int i[4]`] [C style arrays] ] | |
200 | [ [`std::vector<int>`] [Standard vector] ] | |
201 | [ [`std::list<int>`] [Standard list] ] | |
202 | [ [`boost::array<long, 20>`] [Boost array] ] | |
203 | ] | |
204 | ||
205 | [endsect] |