]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | [/============================================================================== |
2 | Copyright (C) 2001-2011 Joel de Guzman | |
3 | Copyright (C) 2001-2011 Hartmut Kaiser | |
4 | ||
5 | Distributed under the Boost Software License, Version 1.0. (See accompanying | |
6 | file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) | |
7 | ===============================================================================/] | |
8 | ||
9 | [section:lexer_tokenizing Tokenizing Input Data] | |
10 | ||
11 | [heading The tokenize function] | |
12 | ||
13 | The `tokenize()` function is a helper function simplifying the usage of a lexer | |
14 | in a stand alone fashion. For instance, you may have a stand alone lexer where all | |
15 | that functional requirements are implemented inside lexer semantic actions. | |
16 | A good example for this is the [@../../example/lex/word_count_lexer.cpp word_count_lexer] | |
17 | described in more detail in the section __sec_lex_quickstart_2__. | |
18 | ||
19 | [wcl_token_definition] | |
20 | ||
21 | The construct used to tokenize the given input, while discarding all generated | |
22 | tokens is a common application of the lexer. For this reason __lex__ exposes an | |
23 | API function `tokenize()` minimizing the code required: | |
24 | ||
25 | // Read input from the given file | |
26 | std::string str (read_from_file(1 == argc ? "word_count.input" : argv[1])); | |
27 | ||
28 | word_count_tokens<lexer_type> word_count_lexer; | |
29 | std::string::iterator first = str.begin(); | |
30 | ||
31 | // Tokenize all the input, while discarding all generated tokens | |
32 | bool r = tokenize(first, str.end(), word_count_lexer); | |
33 | ||
34 | This code is completely equivalent to the more verbose version as shown in the | |
35 | section __sec_lex_quickstart_2__. The function `tokenize()` will return either | |
36 | if the end of the input has been reached (in this case the return value will | |
37 | be `true`), or if the lexer couldn't match any of the token definitions in the | |
38 | input (in this case the return value will be `false` and the iterator `first` | |
39 | will point to the first not matched character in the input sequence). | |
40 | ||
41 | The prototype of this function is: | |
42 | ||
43 | template <typename Iterator, typename Lexer> | |
44 | bool tokenize(Iterator& first, Iterator last, Lexer const& lex | |
45 | , typename Lexer::char_type const* initial_state = 0); | |
46 | ||
47 | [variablelist where: | |
48 | [[Iterator& first] [The beginning of the input sequence to tokenize. The | |
49 | value of this iterator will be updated by the | |
50 | lexer, pointing to the first not matched | |
51 | character of the input after the function | |
52 | returns.]] | |
53 | [[Iterator last] [The end of the input sequence to tokenize.]] | |
54 | [[Lexer const& lex] [The lexer instance to use for tokenization.]] | |
55 | [[Lexer::char_type const* initial_state] | |
56 | [This optional parameter can be used to specify | |
57 | the initial lexer state for tokenization.]] | |
58 | ] | |
59 | ||
60 | A second overload of the `tokenize()` function allows specifying of any arbitrary | |
61 | function or function object to be called for each of the generated tokens. For | |
62 | some applications this is very useful, as it might avoid having lexer semantic | |
63 | actions. For an example of how to use this function, please have a look at | |
64 | [@../../example/lex/word_count_lexer.cpp word_count_functor.cpp]: | |
65 | ||
66 | [wcf_main] | |
67 | ||
68 | Here is the prototype of this `tokenize()` function overload: | |
69 | ||
70 | template <typename Iterator, typename Lexer, typename F> | |
71 | bool tokenize(Iterator& first, Iterator last, Lexer const& lex, F f | |
72 | , typename Lexer::char_type const* initial_state = 0); | |
73 | ||
74 | [variablelist where: | |
75 | [[Iterator& first] [The beginning of the input sequence to tokenize. The | |
76 | value of this iterator will be updated by the | |
77 | lexer, pointing to the first not matched | |
78 | character of the input after the function | |
79 | returns.]] | |
80 | [[Iterator last] [The end of the input sequence to tokenize.]] | |
81 | [[Lexer const& lex] [The lexer instance to use for tokenization.]] | |
82 | [[F f] [A function or function object to be called for | |
83 | each matched token. This function is expected to | |
84 | have the prototype: `bool f(Lexer::token_type);`. | |
85 | The `tokenize()` function will return immediately if | |
86 | `F` returns `false.]] | |
87 | [[Lexer::char_type const* initial_state] | |
88 | [This optional parameter can be used to specify | |
89 | the initial lexer state for tokenization.]] | |
90 | ] | |
91 | ||
92 | [/heading The generate_static_dfa function] | |
93 | ||
94 | [endsect] |