]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | [/ |
2 | / Copyright (c) 2008 Eric Niebler | |
3 | / | |
4 | / Distributed under the Boost Software License, Version 1.0. (See accompanying | |
5 | / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) | |
6 | /] | |
7 | ||
8 | [section Semantic Actions and User-Defined Assertions] | |
9 | ||
10 | [h2 Overview] | |
11 | ||
12 | Imagine you want to parse an input string and build a `std::map<>` from it. For | |
13 | something like that, matching a regular expression isn't enough. You want to | |
14 | /do something/ when parts of your regular expression match. Xpressive lets | |
15 | you attach semantic actions to parts of your static regular expressions. This | |
16 | section shows you how. | |
17 | ||
18 | [h2 Semantic Actions] | |
19 | ||
20 | Consider the following code, which uses xpressive's semantic actions to parse | |
21 | a string of word/integer pairs and stuffs them into a `std::map<>`. It is | |
22 | described below. | |
23 | ||
24 | #include <string> | |
25 | #include <iostream> | |
26 | #include <boost/xpressive/xpressive.hpp> | |
27 | #include <boost/xpressive/regex_actions.hpp> | |
28 | using namespace boost::xpressive; | |
29 | ||
30 | int main() | |
31 | { | |
32 | std::map<std::string, int> result; | |
33 | std::string str("aaa=>1 bbb=>23 ccc=>456"); | |
34 | ||
35 | // Match a word and an integer, separated by =>, | |
36 | // and then stuff the result into a std::map<> | |
37 | sregex pair = ( (s1= +_w) >> "=>" >> (s2= +_d) ) | |
38 | [ ref(result)[s1] = as<int>(s2) ]; | |
39 | ||
40 | // Match one or more word/integer pairs, separated | |
41 | // by whitespace. | |
42 | sregex rx = pair >> *(+_s >> pair); | |
43 | ||
44 | if(regex_match(str, rx)) | |
45 | { | |
46 | std::cout << result["aaa"] << '\n'; | |
47 | std::cout << result["bbb"] << '\n'; | |
48 | std::cout << result["ccc"] << '\n'; | |
49 | } | |
50 | ||
51 | return 0; | |
52 | } | |
53 | ||
54 | This program prints the following: | |
55 | ||
56 | [pre | |
57 | 1 | |
58 | 23 | |
59 | 456 | |
60 | ] | |
61 | ||
62 | The regular expression `pair` has two parts: the pattern and the action. The | |
63 | pattern says to match a word, capturing it in sub-match 1, and an integer, | |
64 | capturing it in sub-match 2, separated by `"=>"`. The action is the part in | |
65 | square brackets: `[ ref(result)[s1] = as<int>(s2) ]`. It says to take sub-match | |
66 | one and use it to index into the `results` map, and assign to it the result of | |
67 | converting sub-match 2 to an integer. | |
68 | ||
69 | [note To use semantic actions with your static regexes, you must | |
70 | `#include <boost/xpressive/regex_actions.hpp>`] | |
71 | ||
72 | How does this work? Just as the rest of the static regular expression, the part | |
73 | between brackets is an expression template. It encodes the action and executes | |
74 | it later. The expression `ref(result)` creates a lazy reference to the `result` | |
75 | object. The larger expression `ref(result)[s1]` is a lazy map index operation. | |
76 | Later, when this action is getting executed, `s1` gets replaced with the | |
77 | first _sub_match_. Likewise, when `as<int>(s2)` gets executed, `s2` is replaced | |
78 | with the second _sub_match_. The `as<>` action converts its argument to the | |
79 | requested type using Boost.Lexical_cast. The effect of the whole action is to | |
80 | insert a new word/integer pair into the map. | |
81 | ||
82 | [note There is an important difference between the function `boost::ref()` in | |
83 | `<boost/ref.hpp>` and `boost::xpressive::ref()` in | |
84 | `<boost/xpressive/regex_actions.hpp>`. The first returns a plain | |
85 | `reference_wrapper<>` which behaves in many respects like an ordinary | |
86 | reference. By contrast, `boost::xpressive::ref()` returns a /lazy/ reference | |
87 | that you can use in expressions that are executed lazily. That is why we can | |
88 | say `ref(result)[s1]`, even though `result` doesn't have an `operator[]` that | |
89 | would accept `s1`.] | |
90 | ||
91 | In addition to the sub-match placeholders `s1`, `s2`, etc., you can also use | |
92 | the placeholder `_` within an action to refer back to the string matched by | |
93 | the sub-expression to which the action is attached. For instance, you can use | |
94 | the following regex to match a bunch of digits, interpret them as an integer | |
95 | and assign the result to a local variable: | |
96 | ||
97 | int i = 0; | |
98 | // Here, _ refers back to all the | |
99 | // characters matched by (+_d) | |
100 | sregex rex = (+_d)[ ref(i) = as<int>(_) ]; | |
101 | ||
102 | [h3 Lazy Action Execution] | |
103 | ||
104 | What does it mean, exactly, to attach an action to part of a regular expression | |
105 | and perform a match? When does the action execute? If the action is part of a | |
106 | repeated sub-expression, does the action execute once or many times? And if the | |
107 | sub-expression initially matches, but ultimately fails because the rest of the | |
108 | regular expression fails to match, is the action executed at all? | |
109 | ||
110 | The answer is that by default, actions are executed /lazily/. When a sub-expression | |
111 | matches a string, its action is placed on a queue, along with the current | |
112 | values of any sub-matches to which the action refers. If the match algorithm | |
113 | must backtrack, actions are popped off the queue as necessary. Only after the | |
114 | entire regex has matched successfully are the actions actually exeucted. They | |
115 | are executed all at once, in the order in which they were added to the queue, | |
116 | as the last step before _regex_match_ returns. | |
117 | ||
118 | For example, consider the following regex that increments a counter whenever | |
119 | it finds a digit. | |
120 | ||
121 | int i = 0; | |
122 | std::string str("1!2!3?"); | |
123 | // count the exciting digits, but not the | |
124 | // questionable ones. | |
125 | sregex rex = +( _d [ ++ref(i) ] >> '!' ); | |
126 | regex_search(str, rex); | |
127 | assert( i == 2 ); | |
128 | ||
129 | The action `++ref(i)` is queued three times: once for each found digit. But | |
130 | it is only /executed/ twice: once for each digit that precedes a `'!'` | |
131 | character. When the `'?'` character is encountered, the match algorithm | |
132 | backtracks, removing the final action from the queue. | |
133 | ||
134 | [h3 Immediate Action Execution] | |
135 | ||
136 | When you want semantic actions to execute immediately, you can wrap the | |
137 | sub-expression containing the action in a [^[funcref boost::xpressive::keep keep()]]. | |
138 | `keep()` turns off back-tracking for its sub-expression, but it also causes | |
139 | any actions queued by the sub-expression to execute at the end of the `keep()`. | |
140 | It is as if the sub-expression in the `keep()` were compiled into an | |
141 | independent regex object, and matching the `keep()` is like a separate invocation | |
142 | of `regex_search()`. It matches characters and executes actions but never backtracks | |
143 | or unwinds. For example, imagine the above example had been written as follows: | |
144 | ||
145 | int i = 0; | |
146 | std::string str("1!2!3?"); | |
147 | // count all the digits. | |
148 | sregex rex = +( keep( _d [ ++ref(i) ] ) >> '!' ); | |
149 | regex_search(str, rex); | |
150 | assert( i == 3 ); | |
151 | ||
152 | We have wrapped the sub-expression `_d [ ++ref(i) ]` in `keep()`. Now, whenever | |
153 | this regex matches a digit, the action will be queued and then immediately | |
154 | executed before we try to match a `'!'` character. In this case, the action | |
155 | executes three times. | |
156 | ||
157 | [note Like `keep()`, actions within [^[funcref boost::xpressive::before before()]] | |
158 | and [^[funcref boost::xpressive::after after()]] are also executed early when their | |
159 | sub-expressions have matched.] | |
160 | ||
161 | [h3 Lazy Functions] | |
162 | ||
163 | So far, we've seen how to write semantic actions consisting of variables and | |
164 | operators. But what if you want to be able to call a function from a semantic | |
165 | action? Xpressive provides a mechanism to do this. | |
166 | ||
167 | The first step is to define a function object type. Here, for instance, is a | |
168 | function object type that calls `push()` on its argument: | |
169 | ||
170 | struct push_impl | |
171 | { | |
172 | // Result type, needed for tr1::result_of | |
173 | typedef void result_type; | |
174 | ||
175 | template<typename Sequence, typename Value> | |
176 | void operator()(Sequence &seq, Value const &val) const | |
177 | { | |
178 | seq.push(val); | |
179 | } | |
180 | }; | |
181 | ||
182 | The next step is to use xpressive's `function<>` template to define a function | |
183 | object named `push`: | |
184 | ||
185 | // Global "push" function object. | |
186 | function<push_impl>::type const push = {{}}; | |
187 | ||
188 | The initialization looks a bit odd, but this is because `push` is being | |
189 | statically initialized. That means it doesn't need to be constructed | |
190 | at runtime. We can use `push` in semantic actions as follows: | |
191 | ||
192 | std::stack<int> ints; | |
193 | // Match digits, cast them to an int | |
194 | // and push it on the stack. | |
195 | sregex rex = (+_d)[push(ref(ints), as<int>(_))]; | |
196 | ||
197 | You'll notice that doing it this way causes member function invocations | |
198 | to look like ordinary function invocations. You can choose to write your | |
199 | semantic action in a different way that makes it look a bit more like | |
200 | a member function call: | |
201 | ||
202 | sregex rex = (+_d)[ref(ints)->*push(as<int>(_))]; | |
203 | ||
204 | Xpressive recognizes the use of the `->*` and treats this expression | |
205 | exactly the same as the one above. | |
206 | ||
207 | When your function object must return a type that depends on its | |
208 | arguments, you can use a `result<>` member template instead of the | |
209 | `result_type` typedef. Here, for example, is a `first` function object | |
210 | that returns the `first` member of a `std::pair<>` or _sub_match_: | |
211 | ||
212 | // Function object that returns the | |
213 | // first element of a pair. | |
214 | struct first_impl | |
215 | { | |
216 | template<typename Sig> struct result {}; | |
217 | ||
218 | template<typename This, typename Pair> | |
219 | struct result<This(Pair)> | |
220 | { | |
221 | typedef typename remove_reference<Pair> | |
222 | ::type::first_type type; | |
223 | }; | |
224 | ||
225 | template<typename Pair> | |
226 | typename Pair::first_type | |
227 | operator()(Pair const &p) const | |
228 | { | |
229 | return p.first; | |
230 | } | |
231 | }; | |
232 | ||
233 | // OK, use as first(s1) to get the begin iterator | |
234 | // of the sub-match referred to by s1. | |
235 | function<first_impl>::type const first = {{}}; | |
236 | ||
237 | [h3 Referring to Local Variables] | |
238 | ||
239 | As we've seen in the examples above, we can refer to local variables within | |
240 | an actions using `xpressive::ref()`. Any such variables are held by reference | |
241 | by the regular expression, and care should be taken to avoid letting those | |
242 | references dangle. For instance, in the following code, the reference to `i` | |
243 | is left to dangle when `bad_voodoo()` returns: | |
244 | ||
245 | sregex bad_voodoo() | |
246 | { | |
247 | int i = 0; | |
248 | sregex rex = +( _d [ ++ref(i) ] >> '!' ); | |
249 | // ERROR! rex refers by reference to a local | |
250 | // variable, which will dangle after bad_voodoo() | |
251 | // returns. | |
252 | return rex; | |
253 | } | |
254 | ||
255 | When writing semantic actions, it is your responsibility to make sure that | |
256 | all the references do not dangle. One way to do that would be to make the | |
257 | variables shared pointers that are held by the regex by value. | |
258 | ||
259 | sregex good_voodoo(boost::shared_ptr<int> pi) | |
260 | { | |
261 | // Use val() to hold the shared_ptr by value: | |
262 | sregex rex = +( _d [ ++*val(pi) ] >> '!' ); | |
263 | // OK, rex holds a reference count to the integer. | |
264 | return rex; | |
265 | } | |
266 | ||
267 | In the above code, we use `xpressive::val()` to hold the shared pointer by | |
268 | value. That's not normally necessary because local variables appearing in | |
269 | actions are held by value by default, but in this case, it is necessary. Had | |
270 | we written the action as `++*pi`, it would have executed immediately. That's | |
271 | because `++*pi` is not an expression template, but `++*val(pi)` is. | |
272 | ||
273 | It can be tedious to wrap all your variables in `ref()` and `val()` in your | |
274 | semantic actions. Xpressive provides the `reference<>` and `value<>` templates | |
275 | to make things easier. The following table shows the equivalencies: | |
276 | ||
277 | [table reference<> and value<> | |
278 | [[This ...][... is equivalent to this ...]] | |
279 | [[``int i = 0; | |
280 | ||
281 | sregex rex = +( _d [ ++ref(i) ] >> '!' );``][``int i = 0; | |
282 | reference<int> ri(i); | |
283 | sregex rex = +( _d [ ++ri ] >> '!' );``]] | |
284 | [[``boost::shared_ptr<int> pi(new int(0)); | |
285 | ||
286 | sregex rex = +( _d [ ++*val(pi) ] >> '!' );``][``boost::shared_ptr<int> pi(new int(0)); | |
287 | value<boost::shared_ptr<int> > vpi(pi); | |
288 | sregex rex = +( _d [ ++*vpi ] >> '!' );``]] | |
289 | ] | |
290 | ||
291 | As you can see, when using `reference<>`, you need to first declare a local | |
292 | variable and then declare a `reference<>` to it. These two steps can be combined | |
293 | into one using `local<>`. | |
294 | ||
295 | [table local<> vs. reference<> | |
296 | [[This ...][... is equivalent to this ...]] | |
297 | [[``local<int> i(0); | |
298 | ||
299 | sregex rex = +( _d [ ++i ] >> '!' );``][``int i = 0; | |
300 | reference<int> ri(i); | |
301 | sregex rex = +( _d [ ++ri ] >> '!' );``]] | |
302 | ] | |
303 | ||
304 | We can use `local<>` to rewrite the above example as follows: | |
305 | ||
306 | local<int> i(0); | |
307 | std::string str("1!2!3?"); | |
308 | // count the exciting digits, but not the | |
309 | // questionable ones. | |
310 | sregex rex = +( _d [ ++i ] >> '!' ); | |
311 | regex_search(str, rex); | |
312 | assert( i.get() == 2 ); | |
313 | ||
314 | Notice that we use `local<>::get()` to access the value of the local | |
315 | variable. Also, beware that `local<>` can be used to create a dangling | |
316 | reference, just as `reference<>` can. | |
317 | ||
318 | [h3 Referring to Non-Local Variables] | |
319 | ||
320 | In the beginning of this | |
321 | section, we used a regex with a semantic action to parse a string of | |
322 | word/integer pairs and stuff them into a `std::map<>`. That required that | |
323 | the map and the regex be defined together and used before either could | |
324 | go out of scope. What if we wanted to define the regex once and use it | |
325 | to fill lots of different maps? We would rather pass the map into the | |
326 | _regex_match_ algorithm rather than embed a reference to it directly in | |
327 | the regex object. What we can do instead is define a placeholder and use | |
328 | that in the semantic action instead of the map itself. Later, when we | |
329 | call one of the regex algorithms, we can bind the reference to an actual | |
330 | map object. The following code shows how. | |
331 | ||
332 | // Define a placeholder for a map object: | |
333 | placeholder<std::map<std::string, int> > _map; | |
334 | ||
335 | // Match a word and an integer, separated by =>, | |
336 | // and then stuff the result into a std::map<> | |
337 | sregex pair = ( (s1= +_w) >> "=>" >> (s2= +_d) ) | |
338 | [ _map[s1] = as<int>(s2) ]; | |
339 | ||
340 | // Match one or more word/integer pairs, separated | |
341 | // by whitespace. | |
342 | sregex rx = pair >> *(+_s >> pair); | |
343 | ||
344 | // The string to parse | |
345 | std::string str("aaa=>1 bbb=>23 ccc=>456"); | |
346 | ||
347 | // Here is the actual map to fill in: | |
348 | std::map<std::string, int> result; | |
349 | ||
350 | // Bind the _map placeholder to the actual map | |
351 | smatch what; | |
352 | what.let( _map = result ); | |
353 | ||
354 | // Execute the match and fill in result map | |
355 | if(regex_match(str, what, rx)) | |
356 | { | |
357 | std::cout << result["aaa"] << '\n'; | |
358 | std::cout << result["bbb"] << '\n'; | |
359 | std::cout << result["ccc"] << '\n'; | |
360 | } | |
361 | ||
362 | This program displays: | |
363 | ||
364 | [pre | |
365 | 1 | |
366 | 23 | |
367 | 456 | |
368 | ] | |
369 | ||
370 | We use `placeholder<>` here to define `_map`, which stands in for a | |
371 | `std::map<>` variable. We can use the placeholder in the semantic action as if | |
372 | it were a map. Then, we define a _match_results_ struct and bind an actual map | |
373 | to the placeholder with "`what.let( _map = result );`". The _regex_match_ call | |
374 | behaves as if the placeholder in the semantic action had been replaced with a | |
375 | reference to `result`. | |
376 | ||
377 | [note Placeholders in semantic actions are not /actually/ replaced at runtime | |
378 | with references to variables. The regex object is never mutated in any way | |
379 | during any of the regex algorithms, so they are safe to use in multiple | |
380 | threads.] | |
381 | ||
382 | The syntax for late-bound action arguments is a little different if you are | |
383 | using _regex_iterator_ or _regex_token_iterator_. The regex iterators accept | |
384 | an extra constructor parameter for specifying the argument bindings. There is | |
385 | a `let()` function that you can use to bind variables to their placeholders. | |
386 | The following code demonstrates how. | |
387 | ||
388 | // Define a placeholder for a map object: | |
389 | placeholder<std::map<std::string, int> > _map; | |
390 | ||
391 | // Match a word and an integer, separated by =>, | |
392 | // and then stuff the result into a std::map<> | |
393 | sregex pair = ( (s1= +_w) >> "=>" >> (s2= +_d) ) | |
394 | [ _map[s1] = as<int>(s2) ]; | |
395 | ||
396 | // The string to parse | |
397 | std::string str("aaa=>1 bbb=>23 ccc=>456"); | |
398 | ||
399 | // Here is the actual map to fill in: | |
400 | std::map<std::string, int> result; | |
401 | ||
402 | // Create a regex_iterator to find all the matches | |
403 | sregex_iterator it(str.begin(), str.end(), pair, let(_map=result)); | |
404 | sregex_iterator end; | |
405 | ||
406 | // step through all the matches, and fill in | |
407 | // the result map | |
408 | while(it != end) | |
409 | ++it; | |
410 | ||
411 | std::cout << result["aaa"] << '\n'; | |
412 | std::cout << result["bbb"] << '\n'; | |
413 | std::cout << result["ccc"] << '\n'; | |
414 | ||
415 | This program displays: | |
416 | ||
417 | [pre | |
418 | 1 | |
419 | 23 | |
420 | 456 | |
421 | ] | |
422 | ||
423 | [h2 User-Defined Assertions] | |
424 | ||
425 | You are probably already familiar with regular expression /assertions/. In | |
426 | Perl, some examples are the [^^] and [^$] assertions, which you can use to | |
427 | match the beginning and end of a string, respectively. Xpressive lets you | |
428 | define your own assertions. A custom assertion is a contition which must be | |
429 | true at a point in the match in order for the match to succeed. You can check | |
430 | a custom assertion with xpressive's _check_ function. | |
431 | ||
432 | There are a couple of ways to define a custom assertion. The simplest is to | |
433 | use a function object. Let's say that you want to ensure that a sub-expression | |
434 | matches a sub-string that is either 3 or 6 characters long. The following | |
435 | struct defines such a predicate: | |
436 | ||
437 | // A predicate that is true IFF a sub-match is | |
438 | // either 3 or 6 characters long. | |
439 | struct three_or_six | |
440 | { | |
441 | bool operator()(ssub_match const &sub) const | |
442 | { | |
443 | return sub.length() == 3 || sub.length() == 6; | |
444 | } | |
445 | }; | |
446 | ||
447 | You can use this predicate within a regular expression as follows: | |
448 | ||
449 | // match words of 3 characters or 6 characters. | |
450 | sregex rx = (bow >> +_w >> eow)[ check(three_or_six()) ] ; | |
451 | ||
452 | The above regular expression will find whole words that are either 3 or 6 | |
453 | characters long. The `three_or_six` predicate accepts a _sub_match_ that refers | |
454 | back to the part of the string matched by the sub-expression to which the | |
455 | custom assertion is attached. | |
456 | ||
457 | [note The custom assertion participates in determining whether the match | |
458 | succeeds or fails. Unlike actions, which execute lazily, custom assertions | |
459 | execute immediately while the regex engine is searching for a match.] | |
460 | ||
461 | Custom assertions can also be defined inline using the same syntax as for | |
462 | semantic actions. Below is the same custom assertion written inline: | |
463 | ||
464 | // match words of 3 characters or 6 characters. | |
465 | sregex rx = (bow >> +_w >> eow)[ check(length(_)==3 || length(_)==6) ] ; | |
466 | ||
467 | In the above, `length()` is a lazy function that calls the `length()` member | |
468 | function of its argument, and `_` is a placeholder that receives the | |
469 | `sub_match`. | |
470 | ||
471 | Once you get the hang of writing custom assertions inline, they can be | |
472 | very powerful. For example, you can write a regular expression that | |
473 | only matches valid dates (for some suitably liberal definition of the | |
474 | term ["valid]). | |
475 | ||
476 | int const days_per_month[] = | |
477 | {31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 31, 31}; | |
478 | ||
479 | mark_tag month(1), day(2); | |
480 | // find a valid date of the form month/day/year. | |
481 | sregex date = | |
482 | ( | |
483 | // Month must be between 1 and 12 inclusive | |
484 | (month= _d >> !_d) [ check(as<int>(_) >= 1 | |
485 | && as<int>(_) <= 12) ] | |
486 | >> '/' | |
487 | // Day must be between 1 and 31 inclusive | |
488 | >> (day= _d >> !_d) [ check(as<int>(_) >= 1 | |
489 | && as<int>(_) <= 31) ] | |
490 | >> '/' | |
491 | // Only consider years between 1970 and 2038 | |
492 | >> (_d >> _d >> _d >> _d) [ check(as<int>(_) >= 1970 | |
493 | && as<int>(_) <= 2038) ] | |
494 | ) | |
495 | // Ensure the month actually has that many days! | |
496 | [ check( ref(days_per_month)[as<int>(month)-1] >= as<int>(day) ) ] | |
497 | ; | |
498 | ||
499 | smatch what; | |
500 | std::string str("99/99/9999 2/30/2006 2/28/2006"); | |
501 | ||
502 | if(regex_search(str, what, date)) | |
503 | { | |
504 | std::cout << what[0] << std::endl; | |
505 | } | |
506 | ||
507 | The above program prints out the following: | |
508 | ||
509 | [pre | |
510 | 2/28/2006 | |
511 | ] | |
512 | ||
513 | Notice how the inline custom assertions are used to range-check the values for | |
514 | the month, day and year. The regular expression doesn't match `"99/99/9999"` or | |
515 | `"2/30/2006"` because they are not valid dates. (There is no 99th month, and | |
516 | February doesn't have 30 days.) | |
517 | ||
518 | [endsect] |