ceph/src/boost/libs/xpressive/doc/tokenization.qbk

   1 [/
   2  / Copyright (c) 2008 Eric Niebler
   3  /
   4  / Distributed under the Boost Software License, Version 1.0. (See accompanying
   5  / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
   6  /]
   7
   8 [section String Splitting and Tokenization]
   9
  10 _regex_token_iterator_ is the Ginsu knife of the text manipulation world. It slices! It dices! This section describes
  11 how to use the highly-configurable _regex_token_iterator_ to chop up input sequences.
  12
  13 [h2 Overview]
  14
  15 You initialize a _regex_token_iterator_ with an input sequence, a regex, and some optional configuration parameters.
  16 The _regex_token_iterator_ will use _regex_search_ to find the first place in the sequence that the regex matches. When
  17 dereferenced, the _regex_token_iterator_ returns a ['token] in the form of a `std::basic_string<>`. Which string it returns
  18 depends on the configuration parameters. By default it returns a string corresponding to the full match, but it could also
  19 return a string corresponding to a particular marked sub-expression, or even the part of the sequence that ['didn't] match.
  20 When you increment the _regex_token_iterator_, it will move to the next token. Which token is next depends on the configuration
  21 parameters. It could simply be a different marked sub-expression in the current match, or it could be part or all of the
  22 next match. Or it could be the part that ['didn't] match.
  23
  24 As you can see, _regex_token_iterator_ can do a lot. That makes it hard to describe, but some examples should make it clear.
  25
  26 [h2 Example 1: Simple Tokenization]
  27
  28 This example uses _regex_token_iterator_ to chop a sequence into a series of tokens consisting of words.
  29
  30     std::string input("This is his face");
  31     sregex re = +_w;                      // find a word
  32
  33     // iterate over all the words in the input
  34     sregex_token_iterator begin( input.begin(), input.end(), re ), end;
  35
  36     // write all the words to std::cout
  37     std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
  38     std::copy( begin, end, out_iter );
  39
  40 This program displays the following:
  41
  42 [pre
  43 This
  44 is
  45 his
  46 face
  47 ]
  48
  49 [h2 Example 2: Simple Tokenization, Reloaded]
  50
  51 This example also uses _regex_token_iterator_ to chop a sequence into a series of tokens consisting of words,
  52 but it uses the regex as a delimiter. When we pass a `-1` as the last parameter to the _regex_token_iterator_
  53 constructor, it instructs the token iterator to consider as tokens those parts of the input that ['didn't]
  54 match the regex.
  55
  56     std::string input("This is his face");
  57     sregex re = +_s;                      // find white space
  58
  59     // iterate over all non-white space in the input. Note the -1 below:
  60     sregex_token_iterator begin( input.begin(), input.end(), re, -1 ), end;
  61
  62     // write all the words to std::cout
  63     std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
  64     std::copy( begin, end, out_iter );
  65
  66 This program displays the following:
  67
  68 [pre
  69 This
  70 is
  71 his
  72 face
  73 ]
  74
  75 [h2 Example 3: Simple Tokenization, Revolutions]
  76
  77 This example also uses _regex_token_iterator_ to chop a sequence containing a bunch of dates into a series of
  78 tokens consisting of just the years. When we pass a positive integer [^['N]] as the last parameter to the
  79 _regex_token_iterator_ constructor, it instructs the token iterator to consider as tokens only the [^['N]]-th
  80 marked sub-expression of each match.
  81
  82     std::string input("01/02/2003 blahblah 04/23/1999 blahblah 11/13/1981");
  83     sregex re = sregex::compile("(\\d{2})/(\\d{2})/(\\d{4})"); // find a date
  84
  85     // iterate over all the years in the input. Note the 3 below, corresponding to the 3rd sub-expression:
  86     sregex_token_iterator begin( input.begin(), input.end(), re, 3 ), end;
  87
  88     // write all the words to std::cout
  89     std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
  90     std::copy( begin, end, out_iter );
  91
  92 This program displays the following:
  93
  94 [pre
  95 2003
  96 1999
  97 1981
  98 ]
  99
 100 [h2 Example 4: Not-So-Simple Tokenization]
 101
 102 This example is like the previous one, except that instead of tokenizing just the years, this program
 103 turns the days, months and years into tokens. When we pass an array of integers [^['{I,J,...}]] as the last
 104 parameter to the _regex_token_iterator_ constructor, it instructs the token iterator to consider as tokens the
 105 [^['I]]-th, [^['J]]-th, etc. marked sub-expression of each match.
 106
 107     std::string input("01/02/2003 blahblah 04/23/1999 blahblah 11/13/1981");
 108     sregex re = sregex::compile("(\\d{2})/(\\d{2})/(\\d{4})"); // find a date
 109
 110     // iterate over the days, months and years in the input
 111     int const sub_matches[] = { 2, 1, 3 }; // day, month, year
 112     sregex_token_iterator begin( input.begin(), input.end(), re, sub_matches ), end;
 113
 114     // write all the words to std::cout
 115     std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
 116     std::copy( begin, end, out_iter );
 117
 118 This program displays the following:
 119
 120 [pre
 121 02
 122 01
 123 2003
 124 23
 125 04
 126 1999
 127 13
 128 11
 129 1981
 130 ]
 131
 132 The `sub_matches` array instructs the _regex_token_iterator_ to first take the value of the 2nd sub-match, then
 133 the 1st sub-match, and finally the 3rd. Incrementing the iterator again instructs it to use _regex_search_ again
 134 to find the next match. At that point, the process repeats -- the token iterator takes the value of the 2nd
 135 sub-match, then the 1st, et cetera.
 136
 137 [endsect]