[ceph.git] / ceph / src / boost / libs / spirit / doc / lex / tokenizing.qbk

[/==============================================================================
    Copyright (C) 2001-2011 Joel de Guzman
    Copyright (C) 2001-2011 Hartmut Kaiser

    Distributed under the Boost Software License, Version 1.0. (See accompanying
    file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
===============================================================================/]

[section:lexer_tokenizing Tokenizing Input Data]

[heading The tokenize function]

The `tokenize()` function is a helper function simplifying the usage of a lexer
in a stand alone fashion. For instance, you may have a stand alone lexer where all 
that functional requirements are implemented inside lexer semantic actions. 
A good example for this is the [@../../example/lex/word_count_lexer.cpp word_count_lexer]
described in more detail in the section __sec_lex_quickstart_2__. 

[wcl_token_definition]

The construct used to tokenize the given input, while discarding all generated 
tokens is a common application of the lexer. For this reason __lex__ exposes an
API function `tokenize()` minimizing the code required:

    // Read input from the given file
    std::string str (read_from_file(1 == argc ? "word_count.input" : argv[1]));

    word_count_tokens<lexer_type> word_count_lexer;
    std::string::iterator first = str.begin();

    // Tokenize all the input, while discarding all generated tokens
    bool r = tokenize(first, str.end(), word_count_lexer);

This code is completely equivalent to the more verbose version as shown in the
section __sec_lex_quickstart_2__. The function `tokenize()` will return either
if the end of the input has been reached (in this case the return value will 
be `true`), or if the lexer couldn't match any of the token definitions in the 
input (in this case the return value will be `false` and the iterator `first`
will point to the first not matched character in the input sequence).

The prototype of this function is:

    template <typename Iterator, typename Lexer>
    bool tokenize(Iterator& first, Iterator last, Lexer const& lex
      , typename Lexer::char_type const* initial_state = 0);

[variablelist where:
    [[Iterator& first]      [The beginning of the input sequence to tokenize. The
                             value of this iterator will be updated by the 
                             lexer, pointing to the first not matched
                             character of the input after the function 
                             returns.]]
    [[Iterator last]        [The end of the input sequence to tokenize.]]
    [[Lexer const& lex]     [The lexer instance to use for tokenization.]]
    [[Lexer::char_type const* initial_state]
                            [This optional parameter can be used to specify 
                             the initial lexer state for tokenization.]]
]

A second overload of the `tokenize()` function allows specifying of any arbitrary 
function or function object to be called for each of the generated tokens. For 
some applications this is very useful, as it might avoid having lexer semantic 
actions. For an example of how to use this function, please have a look at 
[@../../example/lex/word_count_lexer.cpp word_count_functor.cpp]:

[wcf_main]

Here is the prototype of this `tokenize()` function overload:

    template <typename Iterator, typename Lexer, typename F>
    bool tokenize(Iterator& first, Iterator last, Lexer const& lex, F f
      , typename Lexer::char_type const* initial_state = 0);

[variablelist where:
    [[Iterator& first]      [The beginning of the input sequence to tokenize. The
                             value of this iterator will be updated by the 
                             lexer, pointing to the first not matched
                             character of the input after the function 
                             returns.]]
    [[Iterator last]        [The end of the input sequence to tokenize.]]
    [[Lexer const& lex]     [The lexer instance to use for tokenization.]]
    [[F f]                  [A function or function object to be called for 
                             each matched token. This function is expected to 
                             have the prototype: `bool f(Lexer::token_type);`.
                             The `tokenize()` function will return immediately if 
                             `F` returns `false.]]
    [[Lexer::char_type const* initial_state]
                            [This optional parameter can be used to specify 
                             the initial lexer state for tokenization.]]
]

[/heading The generate_static_dfa function]

[endsect]
Commit	Line	Data
7c673cae FG	1	[/==============================================================================
	2	Copyright (C) 2001-2011 Joel de Guzman
	3	Copyright (C) 2001-2011 Hartmut Kaiser
	4
	5	Distributed under the Boost Software License, Version 1.0. (See accompanying
	6	file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
	7	===============================================================================/]
	8
	9	[section:lexer_tokenizing Tokenizing Input Data]
	10
	11	[heading The tokenize function]
	12
	13	The `tokenize()` function is a helper function simplifying the usage of a lexer
	14	in a stand alone fashion. For instance, you may have a stand alone lexer where all
	15	that functional requirements are implemented inside lexer semantic actions.
	16	A good example for this is the [@../../example/lex/word_count_lexer.cpp word_count_lexer]
	17	described in more detail in the section __sec_lex_quickstart_2__.
	18
	19	[wcl_token_definition]
	20
	21	The construct used to tokenize the given input, while discarding all generated
	22	tokens is a common application of the lexer. For this reason __lex__ exposes an
	23	API function `tokenize()` minimizing the code required:
	24
	25	// Read input from the given file
	26	std::string str (read_from_file(1 == argc ? "word_count.input" : argv[1]));
	27
	28	word_count_tokens<lexer_type> word_count_lexer;
	29	std::string::iterator first = str.begin();
	30
	31	// Tokenize all the input, while discarding all generated tokens
	32	bool r = tokenize(first, str.end(), word_count_lexer);
	33
	34	This code is completely equivalent to the more verbose version as shown in the
	35	section __sec_lex_quickstart_2__. The function `tokenize()` will return either
	36	if the end of the input has been reached (in this case the return value will
	37	be `true`), or if the lexer couldn't match any of the token definitions in the
	38	input (in this case the return value will be `false` and the iterator `first`
	39	will point to the first not matched character in the input sequence).
	40
	41	The prototype of this function is:
	42
	43	template <typename Iterator, typename Lexer>
	44	bool tokenize(Iterator& first, Iterator last, Lexer const& lex
	45	, typename Lexer::char_type const* initial_state = 0);
	46
	47	[variablelist where:
	48	[[Iterator& first] [The beginning of the input sequence to tokenize. The
	49	value of this iterator will be updated by the
	50	lexer, pointing to the first not matched
	51	character of the input after the function
	52	returns.]]
	53	[[Iterator last] [The end of the input sequence to tokenize.]]
	54	[[Lexer const& lex] [The lexer instance to use for tokenization.]]
	55	[[Lexer::char_type const* initial_state]
	56	[This optional parameter can be used to specify
	57	the initial lexer state for tokenization.]]
	58	]
	59
	60	A second overload of the `tokenize()` function allows specifying of any arbitrary
	61	function or function object to be called for each of the generated tokens. For
	62	some applications this is very useful, as it might avoid having lexer semantic
	63	actions. For an example of how to use this function, please have a look at
	64	[@../../example/lex/word_count_lexer.cpp word_count_functor.cpp]:
65
66	[wcf_main]
67
68	Here is the prototype of this `tokenize()` function overload:
69
70	template <typename Iterator, typename Lexer, typename F>
71	bool tokenize(Iterator& first, Iterator last, Lexer const& lex, F f
72	, typename Lexer::char_type const* initial_state = 0);
73
74	[variablelist where:
75	[[Iterator& first] [The beginning of the input sequence to tokenize. The
76	value of this iterator will be updated by the
77	lexer, pointing to the first not matched
78	character of the input after the function
79	returns.]]
80	[[Iterator last] [The end of the input sequence to tokenize.]]
81	[[Lexer const& lex] [The lexer instance to use for tokenization.]]
82	[[F f] [A function or function object to be called for
83	each matched token. This function is expected to
84	have the prototype: `bool f(Lexer::token_type);`.
85	The `tokenize()` function will return immediately if
86	`F` returns `false.]]
87	[[Lexer::char_type const* initial_state]
88	[This optional parameter can be used to specify
89	the initial lexer state for tokenization.]]
90	]
91
92	[/heading The generate_static_dfa function]
93
94	[endsect]