]>
Commit | Line | Data |
---|---|---|
1 | [/============================================================================== | |
2 | Copyright (C) 2001-2011 Joel de Guzman | |
3 | Copyright (C) 2001-2011 Hartmut Kaiser | |
4 | ||
5 | Distributed under the Boost Software License, Version 1.0. (See accompanying | |
6 | file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) | |
7 | ===============================================================================/] | |
8 | ||
9 | [section:lexer_introduction Introduction to __lex__] | |
10 | ||
11 | Lexical scanning is the process of analyzing the stream of input characters and | |
12 | separating it into strings called tokens, separated by whitespace. | |
13 | Most compiler texts start here, and devote several chapters to discussing | |
14 | various ways to build scanners. __lex__ is a library built to take care of the | |
15 | complexities of creating a lexer for your grammar (in this documentation we | |
16 | will use the terms 'lexical analyzer', 'lexer' and 'scanner' interchangeably). | |
17 | All that is needed to create a lexer is to know the set of patterns describing | |
18 | the different tokens you want to recognize in the input. To make this a bit more | |
19 | formal, here are some definitions: | |
20 | ||
21 | * A token is a sequence of consecutive characters having a collective meaning. | |
22 | Tokens may have attributes specific to the token type, carrying additional | |
23 | information about the matched character sequence. | |
24 | * A pattern is a rule expressed as a regular expression and describing how a | |
25 | particular token can be formed. For example, [^\[A-Za-z\]\[A-Za-z_0-9\]*] is | |
26 | a pattern for a rule matching C++ identifiers. | |
27 | * Characters between tokens are called whitespace; these include spaces, tabs, | |
28 | newlines, and formfeeds. Many people also count comments as whitespace, | |
29 | though since some tools such as lint look at comments, this method is not | |
30 | perfect. | |
31 | ||
32 | [heading Why Use a Separate Lexer?] | |
33 | ||
34 | Typically, lexical scanning is done in a separate module from the parser, | |
35 | feeding the parser with a stream of input tokens only. Theoretically it is | |
36 | not necessary implement this separation as in the end there is only one set of | |
37 | syntactical rules defining the language, so in theory we could write the whole | |
38 | parser in one module. In fact, __qi__ allows you to write parsers without using a | |
39 | lexer, parsing the input character stream directly, and for the most part this | |
40 | is the way __spirit__ has been used since its invention. | |
41 | ||
42 | However, this separation has both practical and theoretical basis, and proves to | |
43 | be very useful in practical applications. In 1956, Noam Chomsky defined the | |
44 | "Chomsky Hierarchy" of grammars: | |
45 | ||
46 | * Type 0: Unrestricted grammars (e.g., natural languages) | |
47 | * Type 1: Context-Sensitive grammars | |
48 | * Type 2: Context-Free grammars | |
49 | * Type 3: Regular grammars | |
50 | ||
51 | The complexity of these grammars increases from regular grammars being the | |
52 | simplest to unrestricted grammars being the most complex. Similarly, the | |
53 | complexity of pattern recognition for these grammars increases. Although, a few | |
54 | features of some programming languages (such as C++) are Type 1, fortunately | |
55 | for the most part programming languages can be described using only the Types 2 | |
56 | and 3. The neat part about these two types is that they are well known and the | |
57 | ways to parse them are well understood. It has been shown that any regular | |
58 | grammar can be parsed using a state machine (finite automaton). Similarly, | |
59 | context-free grammars can always be parsed using a push-down automaton | |
60 | (essentially a state machine augmented by a stack). | |
61 | ||
62 | In real programming languages and practical grammars, the parts that can be | |
63 | handled as regular expressions tend to be the lower-level pieces, such as the | |
64 | definition of an identifier or of an integer value: | |
65 | ||
66 | letter := [a-zA-Z] | |
67 | digit := [0-9] | |
68 | ||
69 | identifier := letter [ letter | digit ]* | |
70 | integer := digit+ | |
71 | ||
72 | Higher level parts of practical grammars tend to be more complex and can't be | |
73 | implemented using plain regular expressions. We need to store | |
74 | information on the built-in hardware stack while recursing the grammar | |
75 | hierarchy, and that is the preferred approach used for top-down | |
76 | parsing. Since it takes a different kind of abstract machine to parse the two | |
77 | types of grammars, it proved to be efficient to separate the lexical scanner | |
78 | into a separate module which is built around the idea of a state machine. The | |
79 | goal here is to use the simplest parsing technique needed for the job. | |
80 | ||
81 | Another, more practical, reason for separating the scanner from the parser is | |
82 | the need for backtracking during parsing. The input data is a stream of | |
83 | characters, which is often thought to be processed left to right without any | |
84 | backtracking. Unfortunately, in practice most of the time that isn't possible. | |
85 | Almost every language has certain keywords such as IF, FOR, and WHILE. The | |
86 | decision if a certain character sequence actually comprises a keyword or just | |
87 | an identifier often can be made only after seeing the first delimiter /after/ | |
88 | it. In fact, this makes the process backtracking, since we need to store the | |
89 | string long enough to be able to make the decision. The same is true for more | |
90 | coarse grained language features such as nested IF/ELSE statements, where the | |
91 | decision about to which IF belongs the last ELSE statement can be made only | |
92 | after seeing the whole construct. | |
93 | ||
94 | So the structure of a conventional compiler often involves splitting up the | |
95 | functions of the lower-level and higher-level parsing. The lexical scanner | |
96 | deals with things at the character level, collecting characters into strings, | |
97 | converting character sequence into different representations as integers, etc., | |
98 | and passing them along to the parser proper as indivisible tokens. It's also | |
99 | considered normal to let the scanner do additional jobs, such as identifying | |
100 | keywords, storing identifiers in tables, etc. | |
101 | ||
102 | Now, __spirit__ follows this structure, where __lex__ can be used to implement | |
103 | state machine based recognizers, while __qi__ can be used to build recognizers | |
104 | for context free grammars. Since both modules are seamlessly integrated with | |
105 | each other and with the C++ target language it is even possible to use the | |
106 | provided functionality to build more complex grammar recognizers. | |
107 | ||
108 | [heading Advantages of using __lex__] | |
109 | ||
110 | The advantage of using __lex__ to create the lexical analyzer over using more | |
111 | traditional tools such as __flex__ is its carefully crafted integration with | |
112 | the __spirit__ library and the C++ host language. You don't need any external | |
113 | tools to generate the code, your lexer will be perfectly integrated with the | |
114 | rest of your program, making it possible to freely access any context | |
115 | information and data structure. Since the C++ compiler sees all the code, it | |
116 | will generate optimal code no matter what configuration options have been chosen | |
117 | by the user. __lex__ gives you the vast majority of features you could get from | |
118 | a similar __flex__ program without the need to leave C++ as a host language: | |
119 | ||
120 | * The definition of tokens is done using regular expressions (patterns) | |
121 | * The token definitions can refer to special substitution strings (pattern | |
122 | macros) simplifying pattern definitions | |
123 | * The generated lexical scanner may have multiple start states | |
124 | * It is possible to attach code to any of the token definitions; this code gets | |
125 | executed whenever the corresponding token pattern has been matched | |
126 | ||
127 | Even if it is possible to use __lex__ to generate C++ code representing | |
128 | the lexical analyzer (we will refer to that as the /static/ model, described in | |
129 | more detail in the section __sec_lex_static_model__) - a model | |
130 | very similar to the way __flex__ operates - we will mainly focus on the | |
131 | opposite, the /dynamic/ model. You can directly integrate the token definitions | |
132 | into your C++ program, building the lexical analyzer dynamically at runtime. The | |
133 | dynamic model is something not supported by __flex__ or other lexical scanner | |
134 | generators (such as __re2c__, __ragel__, etc.). This dynamic flexibility allows | |
135 | you to speed up the development of your application. | |
136 | ||
137 | [heading The Library Structure of __lex__] | |
138 | ||
139 | The [link spirit.lexerflow figure] below shows a high level overview of how the | |
140 | __lex__ library might be used in an application. __lex__ allows to create | |
141 | lexical analyzers based on patterns. These patterns are regular expression | |
142 | based rules used to define the different tokens to be recognized in the | |
143 | character input sequence. The input sequence is expected to be provided to the | |
144 | lexical analyzer as an arbitrary standard forward iterator. The lexical | |
145 | analyzer itself exposes a standard forward iterator as well. The difference | |
146 | here is that the exposed iterator provides access to the token sequence instead | |
147 | of to the character sequence. The tokens in this sequence are constructed on | |
148 | the fly by analyzing the underlying character sequence and matching this to the | |
149 | patterns as defined by the application. | |
150 | ||
151 | [fig lexerflow.png..The Library structure and Common Flow of Information while using __lex__ in an application..spirit.lexerflow] | |
152 | ||
153 | ||
154 | [endsect] |