ceph/src/boost/libs/xpressive/doc/grammars.qbk

   1 [/
   2  / Copyright (c) 2008 Eric Niebler
   3  /
   4  / Distributed under the Boost Software License, Version 1.0. (See accompanying
   5  / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
   6  /]
   7
   8 [section Grammars and Nested Matches]
   9
  10 [h2 Overview]
  11
  12 One of the key benefits of representing regexes as C++ expressions is the ability to easily refer to other C++
  13 code and data from within the regex. This enables programming idioms that are not possible with other regular
  14 expression libraries. Of particular note is the ability for one regex to refer to another regex, allowing you
  15 to build grammars out of regular expressions. This section describes how to embed one regex in another by value
  16 and by reference, how regex objects behave when they refer to other regexes, and how to access the tree of results
  17 after a successful parse.
  18
  19 [h2 Embedding a Regex by Value]
  20
  21 The _basic_regex_ object has value semantics. When a regex object appears on the right-hand side in the definition
  22 of another regex, it is as if the regex were embedded by value; that is, a copy of the nested regex is stored by
  23 the enclosing regex. The inner regex is invoked by the outer regex during pattern matching. The inner regex
  24 participates fully in the match, back-tracking as needed to make the match succeed.
  25
  26 Consider a text editor that has a regex-find feature with a whole-word option. You can implement this with
  27 xpressive as follows:
  28
  29     find_dialog dlg;
  30     if( dialog_ok == dlg.do_modal() )
  31     {
  32         std::string pattern = dlg.get_text();          // the pattern the user entered
  33         bool whole_word = dlg.whole_word.is_checked(); // did the user select the whole-word option?
  34
  35         sregex re = sregex::compile( pattern );        // try to compile the pattern
  36
  37         if( whole_word )
  38         {
  39             // wrap the regex in begin-word / end-word assertions
  40             re = bow >> re >> eow;
  41         }
  42
  43         // ... use re ...
  44     }
  45
  46 Look closely at this line:
  47
  48     // wrap the regex in begin-word / end-word assertions
  49     re = bow >> re >> eow;
  50
  51 This line creates a new regex that embeds the old regex by value. Then, the new regex is assigned back to
  52 the original regex. Since a copy of the old regex was made on the right-hand side, this works as you might
  53 expect: the new regex has the behavior of the old regex wrapped in begin- and end-word assertions.
  54
  55 [note Note that `re = bow >> re >> eow` does ['not] define a recursive regular expression, since regex
  56 objects embed by value by default. The next section shows how to define a recursive regular expression by
  57 embedding a regex by reference.]
  58
  59 [h2 Embedding a Regex by Reference]
  60
  61 If you want to be able to build recursive regular expressions and context-free grammars, embedding a regex
  62 by value is not enough. You need to be able to make your regular expressions self-referential. Most regular
  63 expression engines don't give you that power, but xpressive does.
  64
  65 [tip The theoretical computer scientists out there will correctly point out that a self-referential
  66 regular expression is not "regular", so in the strict sense, xpressive isn't really a ['regular] expression engine
  67 at all. But as Larry Wall once said, "the term '''[regular expression]''' has grown with the capabilities of our
  68 pattern matching engines, so I'm not going to try to fight linguistic necessity here."]
  69
  70 Consider the following code, which uses the `by_ref()` helper to define a recursive regular expression that
  71 matches balanced, nested parentheses:
  72
  73     sregex parentheses;
  74     parentheses                          // A balanced set of parentheses ...
  75         = '('                            // is an opening parenthesis ...
  76             >>                           // followed by ...
  77              *(                          // zero or more ...
  78                 keep( +~(set='(',')') )  // of a bunch of things that are not parentheses ...
  79               |                          // or ...
  80                 by_ref(parentheses)      // a balanced set of parentheses
  81               )                          //   (ooh, recursion!) ...
  82             >>                           // followed by ...
  83           ')'                            // a closing parenthesis
  84         ;
  85
  86 Matching balanced, nested tags is an important text processing task, and it is one that "classic" regular
  87 expressions cannot do. The `by_ref()` helper makes it possible. It allows one regex object to be embedded
  88 in another ['by reference]. Since the right-hand side holds `parentheses` by reference, assigning the right-hand
  89 side back to `parentheses` creates a cycle, which will execute recursively.
  90
  91 [h2 Building a Grammar]
  92
  93 Once we allow self-reference in our regular expressions, the genie is out of the bottle and all manner of
  94 fun things are possible. In particular, we can now build grammars out of regular expressions. Let's have
  95 a look at the text-book grammar example: the humble calculator.
  96
  97     sregex group, factor, term, expression;
  98
  99     group       = '(' >> by_ref(expression) >> ')';
 100     factor      = +_d | group;
 101     term        = factor >> *(('*' >> factor) | ('/' >> factor));
 102     expression  = term >> *(('+' >> term) | ('-' >> term));
 103
 104 The regex `expression` defined above does something rather remarkable for a regular expression: it matches
 105 mathematical expressions. For example, if the input string were `"foo 9*(10+3) bar"`, this pattern would
 106 match `"9*(10+3)"`. It only matches well-formed mathematical expressions, where the parentheses are
 107 balanced and the infix operators have two arguments each. Don't try this with just any regular expression
 108 engine!
 109
 110 Let's take a closer look at this regular expression grammar. Notice that it is cyclic: `expression` is
 111 implemented in terms of `term`, which is implemented in terms of `factor`, which is implemented in terms
 112 of `group`, which is implemented in terms of `expression`, closing the loop. In general, the way to define
 113 a cyclic grammar is to forward-declare the regex objects and embed by reference those regular expressions
 114 that have not yet been initialized. In the above grammar, there is only one place where we need to reference
 115 a regex object that has not yet been initialized: the definition of `group`. In that place, we use
 116 `by_ref()` to embed `expression` by reference. In all other places, it is sufficient to embed the other
 117 regex objects by value, since they have already been initialized and their values will not change.
 118
 119 [tip [*Embed by value if possible]
 120 \n\n
 121 In general, prefer embedding regular expressions by value rather than by reference. It
 122 involves one less indirection, making your patterns match a little faster. Besides, value semantics
 123 are simpler and will make your grammars easier to reason about. Don't worry about the expense of "copying"
 124 a regex. Each regex object shares its implementation with all of its copies.]
 125
 126 [h2 Dynamic Regex Grammars]
 127
 128 Using _regex_compiler_, you can also build grammars out of dynamic regular expressions.
 129 You do that by creating named regexes, and referring to other regexes by name. Each
 130 _regex_compiler_ instance keeps a mapping from names to regexes that have been created
 131 with it.
 132
 133 You can create a named dynamic regex by prefacing your regex with `"(?$name=)"`, where
 134 /name/ is the name of the regex. You can refer to a named regex from another regex with
 135 `"(?$name)"`. The named regex does not need to exist yet at the time it is referenced
 136 in another regex, but it must exist by the time you use the regex.
 137
 138 Below is a code fragment that uses dynamic regex grammars to implement the calculator
 139 example from above.
 140
 141     using namespace boost::xpressive;
 142     using namespace regex_constants;
 143
 144     sregex expr;
 145
 146     {
 147          sregex_compiler compiler;
 148          syntax_option_type x = ignore_white_space;
 149
 150                 compiler.compile("(? $group  = ) \\( (? $expr ) \\) ", x);
 151                 compiler.compile("(? $factor = ) \\d+ | (? $group ) ", x);
 152                 compiler.compile("(? $term   = ) (? $factor )"
 153                                  " ( \\* (? $factor ) | / (? $factor ) )* ", x);
 154          expr = compiler.compile("(? $expr   = ) (? $term )"
 155                                  "   ( \\+ (? $term ) | - (? $term )   )* ", x);
 156     }
 157
 158     std::string str("foo 9*(10+3) bar");
 159     smatch what;
 160
 161     if(regex_search(str, what, expr))
 162     {
 163          // This prints "9*(10+3)":
 164          std::cout << what[0] << std::endl;
 165     }
 166
 167 As with static regex grammars, nested regex invocations create nested
 168 match results (see /Nested Results/ below). The result is a complete parse tree
 169 for string that matched. Unlike static regexes, dynamic regexes are always
 170 embedded by reference, not by value.
 171
 172 [h2 Cyclic Patterns, Copying and Memory Management, Oh My!]
 173
 174 The calculator examples above raises a number of very complicated memory-management issues. Each of the
 175 four regex objects refer to each other, some directly and some indirectly, some by value and some by
 176 reference. What if we were to return one of them from a function and let the others go out of scope?
 177 What becomes of the references? The answer is that the regex objects are internally reference counted,
 178 such that they keep their referenced regex objects alive as long as they need them. So passing a regex
 179 object by value is never a problem, even if it refers to other regex objects that have gone out of scope.
 180
 181 Those of you who have dealt with reference counting are probably familiar with its Achilles Heel: cyclic
 182 references. If regex objects are reference counted, what happens to cycles like the one created in the
 183 calculator examples? Are they leaked? The answer is no, they are not leaked. The _basic_regex_ object has some tricky
 184 reference tracking code that ensures that even cyclic regex grammars are cleaned up when the last external
 185 reference goes away. So don't worry about it. Create cyclic grammars, pass your regex objects around and
 186 copy them all you want. It is fast and efficient and guaranteed not to leak or result in dangling references.
 187
 188 [h2 Nested Regexes and Sub-Match Scoping]
 189
 190 Nested regular expressions raise the issue of sub-match scoping. If both the inner and outer regex write
 191 to and read from the same sub-match vector, chaos would ensue. The inner regex would stomp on the
 192 sub-matches written by the outer regex. For example, what does this do?
 193
 194     sregex inner = sregex::compile( "(.)\\1" );
 195     sregex outer = (s1= _) >> inner >> s1;
 196
 197 The author probably didn't intend for the inner regex to overwrite the sub-match written by the outer
 198 regex. The problem is particularly acute when the inner regex is accepted from the user as input. The
 199 author has no way of knowing whether the inner regex will stomp the sub-match vector or not. This is
 200 clearly not acceptable.
 201
 202 Instead, what actually happens is that each invocation of a nested regex gets its own scope. Sub-matches
 203 belong to that scope. That is, each nested regex invocation gets its own copy of the sub-match vector to
 204 play with, so there is no way for an inner regex to stomp on the sub-matches of an outer regex. So, for
 205 example, the regex `outer` defined above would match `"ABBA"`, as it should.
 206
 207 [h2 Nested Results]
 208
 209 If nested regexes have their own sub-matches, there should be a way to access them after a successful
 210 match. In fact, there is. After a _regex_match_ or _regex_search_, the _match_results_ struct behaves
 211 like the head of a tree of nested results. The _match_results_ class provides a `nested_results()`
 212 member function that returns an ordered sequence of _match_results_ structures, representing the
 213 results of the nested regexes. The order of the nested results is the same as the order in which
 214 the nested regex objects matched.
 215
 216 Take as an example the regex for balanced, nested parentheses we saw earlier:
 217
 218     sregex parentheses;
 219     parentheses = '(' >> *( keep( +~(set='(',')') ) | by_ref(parentheses) ) >> ')';
 220
 221     smatch what;
 222     std::string str( "blah blah( a(b)c (c(e)f (g)h )i (j)6 )blah" );
 223
 224     if( regex_search( str, what, parentheses ) )
 225     {
 226         // display the whole match
 227         std::cout << what[0] << '\n';
 228
 229         // display the nested results
 230         std::for_each(
 231             what.nested_results().begin(),
 232             what.nested_results().end(),
 233             output_nested_results() );
 234     }
 235
 236 This program displays the following:
 237
 238 [pre
 239 ( a(b)c (c(e)f (g)h )i (j)6 )
 240     (b)
 241     (c(e)f (g)h )
 242         (e)
 243         (g)
 244     (j)
 245 ]
 246
 247 Here you can see how the results are nested and that they are stored in the order in which they
 248 are found.
 249
 250 [tip See the definition of [link boost_xpressive.user_s_guide.examples.display_a_tree_of_nested_results output_nested_results] in the
 251 [link boost_xpressive.user_s_guide.examples Examples] section.]
 252
 253 [h2 Filtering Nested Results]
 254
 255 Sometimes a regex will have several nested regex objects, and you want to know which result corresponds
 256 to which regex object. That's where `basic_regex<>::regex_id()` and `match_results<>::regex_id()`
 257 come in handy. When iterating over the nested results, you can compare the regex id from the results to
 258 the id of the regex object you're interested in.
 259
 260 To make this a bit easier, xpressive provides a predicate to make it simple to iterate over just the
 261 results that correspond to a certain nested regex. It is called `regex_id_filter_predicate`, and it is
 262 intended to be used with _iterator_. You can use it as follows:
 263
 264     sregex name = +alpha;
 265     sregex integer = +_d;
 266     sregex re = *( *_s >> ( name | integer ) );
 267
 268     smatch what;
 269     std::string str( "marsha 123 jan 456 cindy 789" );
 270
 271     if( regex_match( str, what, re ) )
 272     {
 273         smatch::nested_results_type::const_iterator begin = what.nested_results().begin();
 274         smatch::nested_results_type::const_iterator end   = what.nested_results().end();
 275
 276         // declare filter predicates to select just the names or the integers
 277         sregex_id_filter_predicate name_id( name.regex_id() );
 278         sregex_id_filter_predicate integer_id( integer.regex_id() );
 279
 280         // iterate over only the results from the name regex
 281         std::for_each(
 282             boost::make_filter_iterator( name_id, begin, end ),
 283             boost::make_filter_iterator( name_id, end, end ),
 284             output_result
 285             );
 286
 287         std::cout << '\n';
 288
 289         // iterate over only the results from the integer regex
 290         std::for_each(
 291             boost::make_filter_iterator( integer_id, begin, end ),
 292             boost::make_filter_iterator( integer_id, end, end ),
 293             output_result
 294             );
 295     }
 296
 297 where `output_results` is a simple function that takes a `smatch` and displays the full match.
 298 Notice how we use the `regex_id_filter_predicate` together with `basic_regex<>::regex_id()` and
 299 `boost::make_filter_iterator()` from the _iterator_ to select only those results
 300 corresponding to a particular nested regex. This program displays the following:
 301
 302 [pre
 303 marsha
 304 jan
 305 cindy
 306 123
 307 456
 308 789
 309 ]
 310
 311
 312
 313 [endsect]