]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | [/ |
2 | / Copyright (c) 2008 Eric Niebler | |
3 | / | |
4 | / Distributed under the Boost Software License, Version 1.0. (See accompanying | |
5 | / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) | |
6 | /] | |
7 | ||
8 | [section Grammars and Nested Matches] | |
9 | ||
10 | [h2 Overview] | |
11 | ||
12 | One of the key benefits of representing regexes as C++ expressions is the ability to easily refer to other C++ | |
13 | code and data from within the regex. This enables programming idioms that are not possible with other regular | |
14 | expression libraries. Of particular note is the ability for one regex to refer to another regex, allowing you | |
15 | to build grammars out of regular expressions. This section describes how to embed one regex in another by value | |
16 | and by reference, how regex objects behave when they refer to other regexes, and how to access the tree of results | |
17 | after a successful parse. | |
18 | ||
19 | [h2 Embedding a Regex by Value] | |
20 | ||
21 | The _basic_regex_ object has value semantics. When a regex object appears on the right-hand side in the definition | |
22 | of another regex, it is as if the regex were embedded by value; that is, a copy of the nested regex is stored by | |
23 | the enclosing regex. The inner regex is invoked by the outer regex during pattern matching. The inner regex | |
24 | participates fully in the match, back-tracking as needed to make the match succeed. | |
25 | ||
26 | Consider a text editor that has a regex-find feature with a whole-word option. You can implement this with | |
27 | xpressive as follows: | |
28 | ||
29 | find_dialog dlg; | |
30 | if( dialog_ok == dlg.do_modal() ) | |
31 | { | |
32 | std::string pattern = dlg.get_text(); // the pattern the user entered | |
33 | bool whole_word = dlg.whole_word.is_checked(); // did the user select the whole-word option? | |
34 | ||
35 | sregex re = sregex::compile( pattern ); // try to compile the pattern | |
36 | ||
37 | if( whole_word ) | |
38 | { | |
39 | // wrap the regex in begin-word / end-word assertions | |
40 | re = bow >> re >> eow; | |
41 | } | |
42 | ||
43 | // ... use re ... | |
44 | } | |
45 | ||
46 | Look closely at this line: | |
47 | ||
48 | // wrap the regex in begin-word / end-word assertions | |
49 | re = bow >> re >> eow; | |
50 | ||
51 | This line creates a new regex that embeds the old regex by value. Then, the new regex is assigned back to | |
52 | the original regex. Since a copy of the old regex was made on the right-hand side, this works as you might | |
53 | expect: the new regex has the behavior of the old regex wrapped in begin- and end-word assertions. | |
54 | ||
55 | [note Note that `re = bow >> re >> eow` does ['not] define a recursive regular expression, since regex | |
56 | objects embed by value by default. The next section shows how to define a recursive regular expression by | |
57 | embedding a regex by reference.] | |
58 | ||
59 | [h2 Embedding a Regex by Reference] | |
60 | ||
61 | If you want to be able to build recursive regular expressions and context-free grammars, embedding a regex | |
62 | by value is not enough. You need to be able to make your regular expressions self-referential. Most regular | |
63 | expression engines don't give you that power, but xpressive does. | |
64 | ||
65 | [tip The theoretical computer scientists out there will correctly point out that a self-referential | |
66 | regular expression is not "regular", so in the strict sense, xpressive isn't really a ['regular] expression engine | |
67 | at all. But as Larry Wall once said, "the term '''[regular expression]''' has grown with the capabilities of our | |
68 | pattern matching engines, so I'm not going to try to fight linguistic necessity here."] | |
69 | ||
70 | Consider the following code, which uses the `by_ref()` helper to define a recursive regular expression that | |
71 | matches balanced, nested parentheses: | |
72 | ||
73 | sregex parentheses; | |
74 | parentheses // A balanced set of parentheses ... | |
75 | = '(' // is an opening parenthesis ... | |
76 | >> // followed by ... | |
77 | *( // zero or more ... | |
78 | keep( +~(set='(',')') ) // of a bunch of things that are not parentheses ... | |
79 | | // or ... | |
80 | by_ref(parentheses) // a balanced set of parentheses | |
81 | ) // (ooh, recursion!) ... | |
82 | >> // followed by ... | |
83 | ')' // a closing parenthesis | |
84 | ; | |
85 | ||
86 | Matching balanced, nested tags is an important text processing task, and it is one that "classic" regular | |
87 | expressions cannot do. The `by_ref()` helper makes it possible. It allows one regex object to be embedded | |
88 | in another ['by reference]. Since the right-hand side holds `parentheses` by reference, assigning the right-hand | |
89 | side back to `parentheses` creates a cycle, which will execute recursively. | |
90 | ||
91 | [h2 Building a Grammar] | |
92 | ||
93 | Once we allow self-reference in our regular expressions, the genie is out of the bottle and all manner of | |
94 | fun things are possible. In particular, we can now build grammars out of regular expressions. Let's have | |
95 | a look at the text-book grammar example: the humble calculator. | |
96 | ||
97 | sregex group, factor, term, expression; | |
98 | ||
99 | group = '(' >> by_ref(expression) >> ')'; | |
100 | factor = +_d | group; | |
101 | term = factor >> *(('*' >> factor) | ('/' >> factor)); | |
102 | expression = term >> *(('+' >> term) | ('-' >> term)); | |
103 | ||
104 | The regex `expression` defined above does something rather remarkable for a regular expression: it matches | |
105 | mathematical expressions. For example, if the input string were `"foo 9*(10+3) bar"`, this pattern would | |
106 | match `"9*(10+3)"`. It only matches well-formed mathematical expressions, where the parentheses are | |
107 | balanced and the infix operators have two arguments each. Don't try this with just any regular expression | |
108 | engine! | |
109 | ||
110 | Let's take a closer look at this regular expression grammar. Notice that it is cyclic: `expression` is | |
111 | implemented in terms of `term`, which is implemented in terms of `factor`, which is implemented in terms | |
112 | of `group`, which is implemented in terms of `expression`, closing the loop. In general, the way to define | |
113 | a cyclic grammar is to forward-declare the regex objects and embed by reference those regular expressions | |
114 | that have not yet been initialized. In the above grammar, there is only one place where we need to reference | |
115 | a regex object that has not yet been initialized: the definition of `group`. In that place, we use | |
116 | `by_ref()` to embed `expression` by reference. In all other places, it is sufficient to embed the other | |
117 | regex objects by value, since they have already been initialized and their values will not change. | |
118 | ||
119 | [tip [*Embed by value if possible] | |
120 | \n\n | |
121 | In general, prefer embedding regular expressions by value rather than by reference. It | |
122 | involves one less indirection, making your patterns match a little faster. Besides, value semantics | |
123 | are simpler and will make your grammars easier to reason about. Don't worry about the expense of "copying" | |
124 | a regex. Each regex object shares its implementation with all of its copies.] | |
125 | ||
126 | [h2 Dynamic Regex Grammars] | |
127 | ||
128 | Using _regex_compiler_, you can also build grammars out of dynamic regular expressions. | |
129 | You do that by creating named regexes, and referring to other regexes by name. Each | |
130 | _regex_compiler_ instance keeps a mapping from names to regexes that have been created | |
131 | with it. | |
132 | ||
133 | You can create a named dynamic regex by prefacing your regex with `"(?$name=)"`, where | |
134 | /name/ is the name of the regex. You can refer to a named regex from another regex with | |
135 | `"(?$name)"`. The named regex does not need to exist yet at the time it is referenced | |
136 | in another regex, but it must exist by the time you use the regex. | |
137 | ||
138 | Below is a code fragment that uses dynamic regex grammars to implement the calculator | |
139 | example from above. | |
140 | ||
141 | using namespace boost::xpressive; | |
142 | using namespace regex_constants; | |
143 | ||
144 | sregex expr; | |
145 | ||
146 | { | |
147 | sregex_compiler compiler; | |
148 | syntax_option_type x = ignore_white_space; | |
149 | ||
150 | compiler.compile("(? $group = ) \\( (? $expr ) \\) ", x); | |
151 | compiler.compile("(? $factor = ) \\d+ | (? $group ) ", x); | |
152 | compiler.compile("(? $term = ) (? $factor )" | |
153 | " ( \\* (? $factor ) | / (? $factor ) )* ", x); | |
154 | expr = compiler.compile("(? $expr = ) (? $term )" | |
155 | " ( \\+ (? $term ) | - (? $term ) )* ", x); | |
156 | } | |
157 | ||
158 | std::string str("foo 9*(10+3) bar"); | |
159 | smatch what; | |
160 | ||
161 | if(regex_search(str, what, expr)) | |
162 | { | |
163 | // This prints "9*(10+3)": | |
164 | std::cout << what[0] << std::endl; | |
165 | } | |
166 | ||
167 | As with static regex grammars, nested regex invocations create nested | |
168 | match results (see /Nested Results/ below). The result is a complete parse tree | |
169 | for string that matched. Unlike static regexes, dynamic regexes are always | |
170 | embedded by reference, not by value. | |
171 | ||
172 | [h2 Cyclic Patterns, Copying and Memory Management, Oh My!] | |
173 | ||
174 | The calculator examples above raises a number of very complicated memory-management issues. Each of the | |
175 | four regex objects refer to each other, some directly and some indirectly, some by value and some by | |
176 | reference. What if we were to return one of them from a function and let the others go out of scope? | |
177 | What becomes of the references? The answer is that the regex objects are internally reference counted, | |
178 | such that they keep their referenced regex objects alive as long as they need them. So passing a regex | |
179 | object by value is never a problem, even if it refers to other regex objects that have gone out of scope. | |
180 | ||
181 | Those of you who have dealt with reference counting are probably familiar with its Achilles Heel: cyclic | |
182 | references. If regex objects are reference counted, what happens to cycles like the one created in the | |
183 | calculator examples? Are they leaked? The answer is no, they are not leaked. The _basic_regex_ object has some tricky | |
184 | reference tracking code that ensures that even cyclic regex grammars are cleaned up when the last external | |
185 | reference goes away. So don't worry about it. Create cyclic grammars, pass your regex objects around and | |
186 | copy them all you want. It is fast and efficient and guaranteed not to leak or result in dangling references. | |
187 | ||
188 | [h2 Nested Regexes and Sub-Match Scoping] | |
189 | ||
190 | Nested regular expressions raise the issue of sub-match scoping. If both the inner and outer regex write | |
191 | to and read from the same sub-match vector, chaos would ensue. The inner regex would stomp on the | |
192 | sub-matches written by the outer regex. For example, what does this do? | |
193 | ||
194 | sregex inner = sregex::compile( "(.)\\1" ); | |
195 | sregex outer = (s1= _) >> inner >> s1; | |
196 | ||
197 | The author probably didn't intend for the inner regex to overwrite the sub-match written by the outer | |
198 | regex. The problem is particularly acute when the inner regex is accepted from the user as input. The | |
199 | author has no way of knowing whether the inner regex will stomp the sub-match vector or not. This is | |
200 | clearly not acceptable. | |
201 | ||
202 | Instead, what actually happens is that each invocation of a nested regex gets its own scope. Sub-matches | |
203 | belong to that scope. That is, each nested regex invocation gets its own copy of the sub-match vector to | |
204 | play with, so there is no way for an inner regex to stomp on the sub-matches of an outer regex. So, for | |
205 | example, the regex `outer` defined above would match `"ABBA"`, as it should. | |
206 | ||
207 | [h2 Nested Results] | |
208 | ||
209 | If nested regexes have their own sub-matches, there should be a way to access them after a successful | |
210 | match. In fact, there is. After a _regex_match_ or _regex_search_, the _match_results_ struct behaves | |
211 | like the head of a tree of nested results. The _match_results_ class provides a `nested_results()` | |
212 | member function that returns an ordered sequence of _match_results_ structures, representing the | |
213 | results of the nested regexes. The order of the nested results is the same as the order in which | |
214 | the nested regex objects matched. | |
215 | ||
216 | Take as an example the regex for balanced, nested parentheses we saw earlier: | |
217 | ||
218 | sregex parentheses; | |
219 | parentheses = '(' >> *( keep( +~(set='(',')') ) | by_ref(parentheses) ) >> ')'; | |
220 | ||
221 | smatch what; | |
222 | std::string str( "blah blah( a(b)c (c(e)f (g)h )i (j)6 )blah" ); | |
223 | ||
224 | if( regex_search( str, what, parentheses ) ) | |
225 | { | |
226 | // display the whole match | |
227 | std::cout << what[0] << '\n'; | |
228 | ||
229 | // display the nested results | |
230 | std::for_each( | |
231 | what.nested_results().begin(), | |
232 | what.nested_results().end(), | |
233 | output_nested_results() ); | |
234 | } | |
235 | ||
236 | This program displays the following: | |
237 | ||
238 | [pre | |
239 | ( a(b)c (c(e)f (g)h )i (j)6 ) | |
240 | (b) | |
241 | (c(e)f (g)h ) | |
242 | (e) | |
243 | (g) | |
244 | (j) | |
245 | ] | |
246 | ||
247 | Here you can see how the results are nested and that they are stored in the order in which they | |
248 | are found. | |
249 | ||
250 | [tip See the definition of [link boost_xpressive.user_s_guide.examples.display_a_tree_of_nested_results output_nested_results] in the | |
251 | [link boost_xpressive.user_s_guide.examples Examples] section.] | |
252 | ||
253 | [h2 Filtering Nested Results] | |
254 | ||
255 | Sometimes a regex will have several nested regex objects, and you want to know which result corresponds | |
256 | to which regex object. That's where `basic_regex<>::regex_id()` and `match_results<>::regex_id()` | |
257 | come in handy. When iterating over the nested results, you can compare the regex id from the results to | |
258 | the id of the regex object you're interested in. | |
259 | ||
260 | To make this a bit easier, xpressive provides a predicate to make it simple to iterate over just the | |
261 | results that correspond to a certain nested regex. It is called `regex_id_filter_predicate`, and it is | |
262 | intended to be used with _iterator_. You can use it as follows: | |
263 | ||
264 | sregex name = +alpha; | |
265 | sregex integer = +_d; | |
266 | sregex re = *( *_s >> ( name | integer ) ); | |
267 | ||
268 | smatch what; | |
269 | std::string str( "marsha 123 jan 456 cindy 789" ); | |
270 | ||
271 | if( regex_match( str, what, re ) ) | |
272 | { | |
273 | smatch::nested_results_type::const_iterator begin = what.nested_results().begin(); | |
274 | smatch::nested_results_type::const_iterator end = what.nested_results().end(); | |
275 | ||
276 | // declare filter predicates to select just the names or the integers | |
277 | sregex_id_filter_predicate name_id( name.regex_id() ); | |
278 | sregex_id_filter_predicate integer_id( integer.regex_id() ); | |
279 | ||
280 | // iterate over only the results from the name regex | |
281 | std::for_each( | |
282 | boost::make_filter_iterator( name_id, begin, end ), | |
283 | boost::make_filter_iterator( name_id, end, end ), | |
284 | output_result | |
285 | ); | |
286 | ||
287 | std::cout << '\n'; | |
288 | ||
289 | // iterate over only the results from the integer regex | |
290 | std::for_each( | |
291 | boost::make_filter_iterator( integer_id, begin, end ), | |
292 | boost::make_filter_iterator( integer_id, end, end ), | |
293 | output_result | |
294 | ); | |
295 | } | |
296 | ||
297 | where `output_results` is a simple function that takes a `smatch` and displays the full match. | |
298 | Notice how we use the `regex_id_filter_predicate` together with `basic_regex<>::regex_id()` and | |
299 | `boost::make_filter_iterator()` from the _iterator_ to select only those results | |
300 | corresponding to a particular nested regex. This program displays the following: | |
301 | ||
302 | [pre | |
303 | marsha | |
304 | jan | |
305 | cindy | |
306 | 123 | |
307 | 456 | |
308 | 789 | |
309 | ] | |
310 | ||
311 | ||
312 | ||
313 | [endsect] |