]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | [/============================================================================== |
2 | Copyright (C) 2001-2011 Joel de Guzman | |
3 | Copyright (C) 2001-2011 Hartmut Kaiser | |
4 | ||
5 | Distributed under the Boost Software License, Version 1.0. (See accompanying | |
6 | file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) | |
7 | ===============================================================================/] | |
8 | ||
9 | [section:lexer_token_values About Tokens and Token Values] | |
10 | ||
11 | As already discussed, lexical scanning is the process of analyzing the stream | |
12 | of input characters and separating it into strings called tokens, most of the | |
13 | time separated by whitespace. The different token types recognized by a lexical | |
14 | analyzer often get assigned unique integer token identifiers (token ids). These | |
15 | token ids are normally used by the parser to identify the current token without | |
16 | having to look at the matched string again. The __lex__ library is not | |
17 | different with respect to this, as it uses the token ids as the main means of | |
18 | identification of the different token types defined for a particular lexical | |
19 | analyzer. However, it is different from commonly used lexical analyzers in the | |
20 | sense that it returns (references to) instances of a (user defined) token class | |
21 | to the user. The only limitation of this token class is that it must carry at | |
22 | least the token id of the token it represents. For more information about the | |
23 | interface a user defined token type has to expose please look at the | |
24 | __sec_ref_lex_token__ reference. The library provides a default | |
25 | token type based on the __lexertl__ library which should be sufficient in most | |
26 | cases: the __class_lexertl_token__ type. This section focusses on the | |
27 | description of general features a token class may implement and how this | |
28 | integrates with the other parts of the __lex__ library. | |
29 | ||
30 | [heading The Anatomy of a Token] | |
31 | ||
32 | It is very important to understand the difference between a token definition | |
33 | (represented by the __class_token_def__ template) and a token itself (for | |
34 | instance represented by the __class_lexertl_token__ template). | |
35 | ||
36 | The token definition is used to describe the main features of a particular | |
37 | token type, especially: | |
38 | ||
39 | * to simplify the definition of a token type using a regular expression pattern | |
40 | applied while matching this token type, | |
41 | * to associate a token type with a particular lexer state, | |
42 | * to optionally assign a token id to a token type, | |
43 | * to optionally associate some code to execute whenever an instance of this | |
44 | token type has been matched, | |
45 | * and to optionally specify the attribute type of the token value. | |
46 | ||
47 | The token itself is a data structure returned by the lexer iterators. | |
48 | Dereferencing a lexer iterator returns a reference to the last matched token | |
49 | instance. It encapsulates the part of the underlying input sequence matched by | |
50 | the regular expression used during the definition of this token type. | |
51 | Incrementing the lexer iterator invokes the lexical analyzer to | |
52 | match the next token by advancing the underlying input stream. The token data | |
53 | structure contains at least the token id of the matched token type, | |
54 | allowing to identify the matched character sequence. Optionally, the token | |
55 | instance may contain a token value and/or the lexer state this token instance | |
56 | was matched in. The following [link spirit.lex.tokenstructure figure] shows the | |
57 | schematic structure of a token. | |
58 | ||
59 | [fig tokenstructure.png..The structure of a token..spirit.lex.tokenstructure] | |
60 | ||
61 | The token value and the lexer state the token has been recognized in may be | |
62 | omitted for optimization reasons, thus avoiding the need for the token to carry | |
63 | more data than actually required. This configuration can be achieved by supplying | |
64 | appropriate template parameters for the | |
65 | __class_lexertl_token__ template while defining the token type. | |
66 | ||
67 | The lexer iterator returns the same token type for each of the different | |
68 | matched token definitions. To accommodate for the possible different token | |
69 | /value/ types exposed by the various token types (token definitions), the | |
70 | general type of the token value is a __boost_variant__. At a minimum (for the | |
71 | default configuration) this token value variant will be configured to always | |
72 | hold a __boost_iterator_range__ containing the pair of iterators pointing to | |
73 | the matched input sequence for this token instance. | |
74 | ||
75 | [note If the lexical analyzer is used in conjunction with a __qi__ parser, the | |
76 | stored __boost_iterator_range__ token value will be converted to the | |
77 | requested token type (parser attribute) exactly once. This happens at the | |
78 | time of the first access to the token value requiring the | |
79 | corresponding type conversion. The converted token value will be stored | |
80 | in the __boost_variant__ replacing the initially stored iterator range. | |
81 | This avoids having to convert the input sequence to the token value more | |
82 | than once, thus optimizing the integration of the lexer with __qi__, even | |
83 | during parser backtracking. | |
84 | ] | |
85 | ||
86 | Here is the template prototype of the __class_lexertl_token__ template: | |
87 | ||
88 | template < | |
89 | typename Iterator = char const*, | |
90 | typename AttributeTypes = mpl::vector0<>, | |
91 | typename HasState = mpl::true_ | |
92 | > | |
93 | struct lexertl_token; | |
94 | ||
95 | [variablelist where: | |
96 | [[Iterator] [This is the type of the iterator used to access the | |
97 | underlying input stream. It defaults to a plain | |
98 | `char const*`.]] | |
99 | [[AttributeTypes] [This is either a mpl sequence containing all | |
100 | attribute types used for the token definitions or the | |
101 | type `omit`. If the mpl sequence is empty (which is | |
102 | the default), all token instances will store a | |
103 | __boost_iterator_range__`<Iterator>` pointing to the start | |
104 | and the end of the matched section in the input stream. | |
105 | If the type is `omit`, the generated tokens will | |
106 | contain no token value (attribute) at all.]] | |
107 | [[HasState] [This is either `mpl::true_` or `mpl::false_`, allowing | |
108 | control as to whether the generated token instances will | |
109 | contain the lexer state they were generated in. The | |
110 | default is mpl::true_, so all token instances will | |
111 | contain the lexer state.]] | |
112 | ] | |
113 | ||
114 | Normally, during construction, a token instance always holds the | |
115 | __boost_iterator_range__ as its token value, unless it has been defined | |
116 | using the `omit` token value type. This iterator range then is | |
117 | converted in place to the requested token value type (attribute) when it is | |
118 | requested for the first time. | |
119 | ||
120 | ||
121 | [heading The Physiognomy of a Token Definition] | |
122 | ||
123 | The token definitions (represented by the __class_token_def__ template) are | |
124 | normally used as part of the definition of the lexical analyzer. At the same | |
125 | time a token definition instance may be used as a parser component in __qi__. | |
126 | ||
127 | The template prototype of this class is shown here: | |
128 | ||
129 | template< | |
130 | typename Attribute = unused_type, | |
131 | typename Char = char | |
132 | > | |
133 | class token_def; | |
134 | ||
135 | [variablelist where: | |
136 | [[Attribute] [This is the type of the token value (attribute) | |
137 | supported by token instances representing this token | |
138 | type. This attribute type is exposed to the __qi__ | |
139 | library, whenever this token definition is used as a | |
140 | parser component. The default attribute type is | |
141 | `unused_type`, which means the token instance holds a | |
142 | __boost_iterator_range__ pointing to the start | |
143 | and the end of the matched section in the input stream. | |
144 | If the attribute is `omit` the token instance will | |
145 | expose no token type at all. Any other type will be | |
146 | used directly as the token value type.]] | |
147 | [[Char] [This is the value type of the iterator for the | |
148 | underlying input sequence. It defaults to `char`.]] | |
149 | ] | |
150 | ||
151 | The semantics of the template parameters for the token type and the token | |
152 | definition type are very similar and interdependent. As a rule of thumb you can | |
153 | think of the token definition type as the means of specifying everything | |
154 | related to a single specific token type (such as `identifier` or `integer`). | |
155 | On the other hand the token type is used to define the general properties of all | |
156 | token instances generated by the __lex__ library. | |
157 | ||
158 | [important If you don't list any token value types in the token type definition | |
159 | declaration (resulting in the usage of the default __boost_iterator_range__ | |
160 | token type) everything will compile and work just fine, just a bit | |
161 | less efficient. This is because the token value will be converted | |
162 | from the matched input sequence every time it is requested. | |
163 | ||
164 | But as soon as you specify at least one token value type while | |
165 | defining the token type you'll have to list all value types used for | |
166 | __class_token_def__ declarations in the token definition class, | |
167 | otherwise compilation errors will occur. | |
168 | ] | |
169 | ||
170 | ||
171 | [heading Examples of using __class_lexertl_token__] | |
172 | ||
173 | Let's start with some examples. We refer to one of the __lex__ examples (for | |
174 | the full source code of this example please see | |
175 | [@../../example/lex/example4.cpp example4.cpp]). | |
176 | ||
177 | [import ../example/lex/example4.cpp] | |
178 | ||
179 | The first code snippet shows an excerpt of the token definition class, the | |
180 | definition of a couple of token types. Some of the token types do not expose a | |
181 | special token value (`if_`, `else_`, and `while_`). Their token value will | |
182 | always hold the iterator range of the matched input sequence. The token | |
183 | definitions for the `identifier` and the integer `constant` are specialized | |
184 | to expose an explicit token type each: `std::string` and `unsigned int`. | |
185 | ||
186 | [example4_token_def] | |
187 | ||
188 | As the parsers generated by __qi__ are fully attributed, any __qi__ parser | |
189 | component needs to expose a certain type as its parser attribute. Naturally, | |
190 | the __class_token_def__ exposes the token value type as its parser attribute, | |
191 | enabling a smooth integration with __qi__. | |
192 | ||
193 | The next code snippet demonstrates how the required token value types are | |
194 | specified while defining the token type to use. All of the token value types | |
195 | used for at least one of the token definitions have to be re-iterated for the | |
196 | token definition as well. | |
197 | ||
198 | [example4_token] | |
199 | ||
200 | To avoid the token to have a token value at all, the special tag `omit` can | |
201 | be used: `token_def<omit>` and `lexertl_token<base_iterator_type, omit>`. | |
202 | ||
203 | ||
204 | ||
205 | ||
206 | ||
207 | ||
208 | [endsect] |