[ceph.git] / ceph / src / boost / libs / spirit / doc / lex / tokens_values.qbk

[/==============================================================================
    Copyright (C) 2001-2011 Joel de Guzman
    Copyright (C) 2001-2011 Hartmut Kaiser

    Distributed under the Boost Software License, Version 1.0. (See accompanying
    file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
===============================================================================/]

[section:lexer_token_values About Tokens and Token Values]

As already discussed, lexical scanning is the process of analyzing the stream 
of input characters and separating it into strings called tokens, most of the 
time separated by whitespace. The different token types recognized by a lexical 
analyzer often get assigned unique integer token identifiers (token ids). These 
token ids are normally used by the parser to identify the current token without 
having to look at the matched string again. The __lex__ library is not 
different with respect to this, as it uses the token ids as the main means of 
identification of the different token types defined for a particular lexical 
analyzer. However, it is different from commonly used lexical analyzers in the 
sense that it returns (references to) instances of a (user defined) token class 
to the user. The only limitation of this token class is that it must carry at 
least the token id of the token it represents. For more information about the 
interface a user defined token type has to expose please look at the 
__sec_ref_lex_token__ reference. The library provides a default 
token type based on the __lexertl__ library which should be sufficient in most 
cases: the __class_lexertl_token__ type. This section focusses on the 
description of general features a token class may implement and how this 
integrates with the other parts of the __lex__ library.

[heading The Anatomy of a Token]

It is very important to understand the difference between a token definition 
(represented by the __class_token_def__ template) and a token itself (for 
instance represented by the __class_lexertl_token__ template). 

The token definition is used to describe the main features of a particular 
token type, especially:

* to simplify the definition of a token type using a regular expression pattern 
  applied while matching this token type,
* to associate a token type with a particular lexer state,
* to optionally assign a token id to a token type,
* to optionally associate some code to execute whenever an instance of this 
  token type has been matched,
* and to optionally specify the attribute type of the token value.

The token itself is a data structure returned by the lexer iterators. 
Dereferencing a lexer iterator returns a reference to the last matched token 
instance. It encapsulates the part of the underlying input sequence matched by
the regular expression used during the definition of this token type. 
Incrementing the lexer iterator invokes the lexical analyzer to
match the next token by advancing the underlying input stream. The token data 
structure contains at least the token id of the matched token type, 
allowing to identify the matched character sequence. Optionally, the token 
instance may contain a token value and/or the lexer state this token instance
was matched in. The following [link spirit.lex.tokenstructure figure] shows the 
schematic structure of a token.

[fig tokenstructure.png..The structure of a token..spirit.lex.tokenstructure]

The token value and the lexer state the token has been recognized in may be 
omitted for optimization reasons, thus avoiding the need for the token to carry 
more data than actually required. This configuration can be achieved by supplying 
appropriate template parameters for the 
__class_lexertl_token__ template while defining the token type. 

The lexer iterator returns the same token type for each of the different 
matched token definitions. To accommodate for the possible different token 
/value/ types exposed by the various token types (token definitions), the 
general type of the token value is a __boost_variant__. At a minimum (for the 
default configuration) this token value variant will be configured to always 
hold a __boost_iterator_range__ containing the pair of iterators pointing to 
the matched input sequence for this token instance.

[note If the lexical analyzer is used in conjunction with a __qi__ parser, the 
      stored __boost_iterator_range__ token value will be converted to the 
      requested token type (parser attribute) exactly once. This happens at the 
      time of the first access to the token value requiring the  
      corresponding type conversion. The converted token value will be stored 
      in the __boost_variant__ replacing the initially stored iterator range.
      This avoids having to convert the input sequence to the token value more 
      than once, thus optimizing the integration of the lexer with __qi__, even 
      during parser backtracking.
]

Here is the template prototype of the __class_lexertl_token__ template:

    template <
        typename Iterator = char const*, 
        typename AttributeTypes = mpl::vector0<>, 
        typename HasState = mpl::true_
    >
    struct lexertl_token;

[variablelist where:
    [[Iterator]       [This is the type of the iterator used to access the 
                       underlying input stream. It defaults to a plain 
                       `char const*`.]]
    [[AttributeTypes] [This is either a mpl sequence containing all 
                       attribute types used for the token definitions or the
                       type `omit`. If the mpl sequence is empty (which is 
                       the default), all token instances will store a 
                       __boost_iterator_range__`<Iterator>` pointing to the start
                       and the end of the matched section in the input stream.
                       If the type is `omit`, the generated tokens will
                       contain no token value (attribute) at all.]]
    [[HasState]       [This is either `mpl::true_` or `mpl::false_`, allowing
                       control as to whether the generated token instances will
                       contain the lexer state they were generated in. The 
                       default is mpl::true_, so all token instances will 
                       contain the lexer state.]]
]

Normally, during construction, a token instance always holds the 
__boost_iterator_range__ as its token value, unless it has been defined 
using the `omit` token value type. This iterator range then is 
converted in place to the requested token value type (attribute) when it is
requested for the first time.


[heading The Physiognomy of a Token Definition]

The token definitions (represented by the __class_token_def__ template) are 
normally used as part of the definition of the lexical analyzer. At the same 
time a token definition instance may be used as a parser component in __qi__.

The template prototype of this class is shown here:

    template<
        typename Attribute = unused_type, 
        typename Char = char
    >
    class token_def;

[variablelist where:
    [[Attribute]      [This is the type of the token value (attribute) 
                       supported by token instances representing this token 
                       type. This attribute type is exposed to the __qi__ 
                       library, whenever this token definition is used as a 
                       parser component. The default attribute type is 
                       `unused_type`, which means the token instance holds a
                       __boost_iterator_range__ pointing to the start
                       and the end of the matched section in the input stream.
                       If the attribute is `omit` the token instance will
                       expose no token type at all. Any other type will be 
                       used directly as the token value type.]]
    [[Char]           [This is the value type of the iterator for the 
                       underlying input sequence. It defaults to `char`.]]
]

The semantics of the template parameters for the token type and the token 
definition type are very similar and interdependent. As a rule of thumb you can 
think of the token definition type as the means of specifying everything 
related to a single specific token type (such as `identifier` or `integer`).
On the other hand the token type is used to define the general properties of all
token instances generated by the __lex__ library.

[important If you don't list any token value types in the token type definition 
           declaration (resulting in the usage of the default __boost_iterator_range__ 
           token type) everything will compile and work just fine, just a bit 
           less efficient. This is because the token value will be converted 
           from the matched input sequence every time it is requested. 
           
           But as soon as you specify at least one token value type while 
           defining the token type you'll have to list all value types used for 
           __class_token_def__ declarations in the token definition class, 
           otherwise compilation errors will occur.
]


[heading Examples of using __class_lexertl_token__]

Let's start with some examples. We refer to one of the __lex__ examples (for 
the full source code of this example please see 
[@../../example/lex/example4.cpp example4.cpp]).

[import ../example/lex/example4.cpp]

The first code snippet shows an excerpt of the token definition class, the
definition of a couple of token types. Some of the token types do not expose a
special token value (`if_`, `else_`, and `while_`). Their token value will
always hold the iterator range of the matched input sequence. The token
definitions for the `identifier` and the integer `constant` are specialized 
to expose an explicit token type each: `std::string` and `unsigned int`.

[example4_token_def]

As the parsers generated by __qi__ are fully attributed, any __qi__ parser 
component needs to expose a certain type as its parser attribute. Naturally, 
the __class_token_def__ exposes the token value type as its parser attribute, 
enabling a smooth integration with __qi__.

The next code snippet demonstrates how the required token value types are 
specified while defining the token type to use. All of the token value types
used for at least one of the token definitions have to be re-iterated for the 
token definition as well.

[example4_token]

To avoid the token to have a token value at all, the special tag `omit` can 
be used: `token_def<omit>` and `lexertl_token<base_iterator_type, omit>`.


[endsect]
Commit	Line	Data
7c673cae FG	1	[/==============================================================================
	2	Copyright (C) 2001-2011 Joel de Guzman
	3	Copyright (C) 2001-2011 Hartmut Kaiser
	4
	5	Distributed under the Boost Software License, Version 1.0. (See accompanying
	6	file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
	7	===============================================================================/]
	8
	9	[section:lexer_token_values About Tokens and Token Values]
	10
	11	As already discussed, lexical scanning is the process of analyzing the stream
	12	of input characters and separating it into strings called tokens, most of the
	13	time separated by whitespace. The different token types recognized by a lexical
	14	analyzer often get assigned unique integer token identifiers (token ids). These
	15	token ids are normally used by the parser to identify the current token without
	16	having to look at the matched string again. The __lex__ library is not
	17	different with respect to this, as it uses the token ids as the main means of
	18	identification of the different token types defined for a particular lexical
	19	analyzer. However, it is different from commonly used lexical analyzers in the
	20	sense that it returns (references to) instances of a (user defined) token class
	21	to the user. The only limitation of this token class is that it must carry at
	22	least the token id of the token it represents. For more information about the
	23	interface a user defined token type has to expose please look at the
	24	__sec_ref_lex_token__ reference. The library provides a default
	25	token type based on the __lexertl__ library which should be sufficient in most
	26	cases: the __class_lexertl_token__ type. This section focusses on the
	27	description of general features a token class may implement and how this
	28	integrates with the other parts of the __lex__ library.
	29
	30	[heading The Anatomy of a Token]
	31
	32	It is very important to understand the difference between a token definition
	33	(represented by the __class_token_def__ template) and a token itself (for
	34	instance represented by the __class_lexertl_token__ template).
	35
	36	The token definition is used to describe the main features of a particular
	37	token type, especially:
	38
	39	* to simplify the definition of a token type using a regular expression pattern
	40	applied while matching this token type,
	41	* to associate a token type with a particular lexer state,
	42	* to optionally assign a token id to a token type,
	43	* to optionally associate some code to execute whenever an instance of this
	44	token type has been matched,
	45	* and to optionally specify the attribute type of the token value.
	46
	47	The token itself is a data structure returned by the lexer iterators.
	48	Dereferencing a lexer iterator returns a reference to the last matched token
	49	instance. It encapsulates the part of the underlying input sequence matched by
	50	the regular expression used during the definition of this token type.
	51	Incrementing the lexer iterator invokes the lexical analyzer to
	52	match the next token by advancing the underlying input stream. The token data
	53	structure contains at least the token id of the matched token type,
	54	allowing to identify the matched character sequence. Optionally, the token
	55	instance may contain a token value and/or the lexer state this token instance
	56	was matched in. The following [link spirit.lex.tokenstructure figure] shows the
	57	schematic structure of a token.
	58
	59	[fig tokenstructure.png..The structure of a token..spirit.lex.tokenstructure]
	60
	61	The token value and the lexer state the token has been recognized in may be
	62	omitted for optimization reasons, thus avoiding the need for the token to carry
	63	more data than actually required. This configuration can be achieved by supplying
	64	appropriate template parameters for the
65	__class_lexertl_token__ template while defining the token type.
66
67	The lexer iterator returns the same token type for each of the different
68	matched token definitions. To accommodate for the possible different token
69	/value/ types exposed by the various token types (token definitions), the
70	general type of the token value is a __boost_variant__. At a minimum (for the
71	default configuration) this token value variant will be configured to always
72	hold a __boost_iterator_range__ containing the pair of iterators pointing to
73	the matched input sequence for this token instance.
74
75	[note If the lexical analyzer is used in conjunction with a __qi__ parser, the
76	stored __boost_iterator_range__ token value will be converted to the
77	requested token type (parser attribute) exactly once. This happens at the
78	time of the first access to the token value requiring the
79	corresponding type conversion. The converted token value will be stored
80	in the __boost_variant__ replacing the initially stored iterator range.
81	This avoids having to convert the input sequence to the token value more
82	than once, thus optimizing the integration of the lexer with __qi__, even
83	during parser backtracking.
84	]
85
86	Here is the template prototype of the __class_lexertl_token__ template:
87
88	template <
89	typename Iterator = char const*,
90	typename AttributeTypes = mpl::vector0<>,
91	typename HasState = mpl::true_
92	>
93	struct lexertl_token;
94
95	[variablelist where:
96	[[Iterator] [This is the type of the iterator used to access the
97	underlying input stream. It defaults to a plain
98	`char const*`.]]
99	[[AttributeTypes] [This is either a mpl sequence containing all
100	attribute types used for the token definitions or the
101	type `omit`. If the mpl sequence is empty (which is
102	the default), all token instances will store a
103	__boost_iterator_range__`<Iterator>` pointing to the start
104	and the end of the matched section in the input stream.
105	If the type is `omit`, the generated tokens will
106	contain no token value (attribute) at all.]]
107	[[HasState] [This is either `mpl::true_` or `mpl::false_`, allowing
108	control as to whether the generated token instances will
109	contain the lexer state they were generated in. The
110	default is mpl::true_, so all token instances will
111	contain the lexer state.]]
112	]
113
114	Normally, during construction, a token instance always holds the
115	__boost_iterator_range__ as its token value, unless it has been defined
116	using the `omit` token value type. This iterator range then is
117	converted in place to the requested token value type (attribute) when it is
118	requested for the first time.
119
120
121	[heading The Physiognomy of a Token Definition]
122
123	The token definitions (represented by the __class_token_def__ template) are
124	normally used as part of the definition of the lexical analyzer. At the same
125	time a token definition instance may be used as a parser component in __qi__.
126
127	The template prototype of this class is shown here:
128
129	template<
130	typename Attribute = unused_type,
131	typename Char = char
132	>
133	class token_def;
134
135	[variablelist where:
136	[[Attribute] [This is the type of the token value (attribute)
137	supported by token instances representing this token
138	type. This attribute type is exposed to the __qi__
139	library, whenever this token definition is used as a
140	parser component. The default attribute type is
141	`unused_type`, which means the token instance holds a
142	__boost_iterator_range__ pointing to the start
143	and the end of the matched section in the input stream.
144	If the attribute is `omit` the token instance will
145	expose no token type at all. Any other type will be
146	used directly as the token value type.]]
147	[[Char] [This is the value type of the iterator for the
148	underlying input sequence. It defaults to `char`.]]
149	]
150
151	The semantics of the template parameters for the token type and the token
152	definition type are very similar and interdependent. As a rule of thumb you can
153	think of the token definition type as the means of specifying everything
154	related to a single specific token type (such as `identifier` or `integer`).
155	On the other hand the token type is used to define the general properties of all
156	token instances generated by the __lex__ library.
157
158	[important If you don't list any token value types in the token type definition
159	declaration (resulting in the usage of the default __boost_iterator_range__
160	token type) everything will compile and work just fine, just a bit
161	less efficient. This is because the token value will be converted
162	from the matched input sequence every time it is requested.
163
164	But as soon as you specify at least one token value type while
165	defining the token type you'll have to list all value types used for
166	__class_token_def__ declarations in the token definition class,
167	otherwise compilation errors will occur.
168	]
169
170
171	[heading Examples of using __class_lexertl_token__]
172
173	Let's start with some examples. We refer to one of the __lex__ examples (for
174	the full source code of this example please see
175	[@../../example/lex/example4.cpp example4.cpp]).
176
177	[import ../example/lex/example4.cpp]
178
179	The first code snippet shows an excerpt of the token definition class, the
180	definition of a couple of token types. Some of the token types do not expose a
181	special token value (`if_`, `else_`, and `while_`). Their token value will
182	always hold the iterator range of the matched input sequence. The token
183	definitions for the `identifier` and the integer `constant` are specialized
184	to expose an explicit token type each: `std::string` and `unsigned int`.
185
186	[example4_token_def]
187
188	As the parsers generated by __qi__ are fully attributed, any __qi__ parser
189	component needs to expose a certain type as its parser attribute. Naturally,
190	the __class_token_def__ exposes the token value type as its parser attribute,
191	enabling a smooth integration with __qi__.
192
193	The next code snippet demonstrates how the required token value types are
194	specified while defining the token type to use. All of the token value types
195	used for at least one of the token definitions have to be re-iterated for the
196	token definition as well.
197
198	[example4_token]
199
200	To avoid the token to have a token value at all, the special tag `omit` can
201	be used: `token_def<omit>` and `lexertl_token<base_iterator_type, omit>`.
202
203
204
205
206
207
208	[endsect]