[ceph.git] / ceph / src / boost / libs / regex / doc / captures.qbk

[/ 
  Copyright 2006-2007 John Maddock.
  Distributed under the Boost Software License, Version 1.0.
  (See accompanying file LICENSE_1_0.txt or copy at
  http://www.boost.org/LICENSE_1_0.txt).
]


[section:captures Understanding Marked Sub-Expressions and Captures]

Captures are the iterator ranges that are "captured" by marked 
sub-expressions as a regular expression gets matched.  Each marked 
sub-expression can result in more than one capture, if it is matched 
more than once.  This document explains how captures and marked 
sub-expressions in Boost.Regex are represented and accessed.

[h4 Marked sub-expressions]

Every time a Perl regular expression contains a parenthesis group `()`, it 
spits out an extra field, known as a marked sub-expression, 
for example the expression:

[pre (\w+)\W+(\w+)]

Has two marked sub-expressions (known as $1 and $2 respectively), in 
addition the complete match is known as $&, everything before the 
first match as $\`, and everything after the match as $'.  So 
if the above expression is searched for within `"@abc def--"`, then we obtain:

[table
[[Sub-expression][Text found]]
[[$\`]["@"]]
[[$&]["abc def"]]
[[$1]["abc"]]
[[$2]["def"]]
[[$']["--"]]
]

In Boost.Regex all these are accessible via the [match_results] class that 
gets filled in when calling one of the regular expression matching algorithms 
([regex_search], [regex_match], or [regex_iterator]).  So given:

   boost::match_results<IteratorType> m;

The Perl and Boost.Regex equivalents are as follows:

[table 
[[Perl][Boost.Regex]]
[[$\`][`m.prefix()`]]
[[$&][`m[0]`]]
[[$n][`m[n]`]]
[[$\'][`m.suffix()`]]
]

In Boost.Regex each sub-expression match is represented by a [sub_match] object, 
this is basically just a pair of iterators denoting the start and end 
position of the sub-expression match, but there are some additional 
operators provided so that objects of type [sub_match] behave a lot like a 
`std::basic_string`: for example they are implicitly convertible to a 
`basic_string`, they can be compared to a string, added to a string, or 
streamed out to an output stream.

[h4 Unmatched Sub-Expressions]

When a regular expression match is found there is no need for all of the 
marked sub-expressions to have participated in the match, for example the expression:

[pre (abc)|(def)]

can match either $1 or $2, but never both at the same time.  In Boost.Regex 
you can determine which sub-expressions matched by accessing the 
`sub_match::matched` data member.

[h4 Repeated Captures]

When a marked sub-expression is repeated, then the sub-expression gets 
"captured" multiple times, however normally only the final capture is available, 
for example if

[pre (?:(\w+)\W+)+]

is matched against

[pre one fine day]

Then $1 will contain the string "day", and all the previous captures will have 
been forgotten.

However, Boost.Regex has an experimental feature that allows all the capture 
information to be retained - this is accessed either via the 
`match_results::captures` member function or the `sub_match::captures` member 
function.  These functions return a container that contains a sequence of all 
the captures obtained during the regular expression matching.  The following 
example program shows how this information may be used:

   #include <boost/regex.hpp>
   #include <iostream>

   void print_captures(const std::string& regx, const std::string& text)
   {
      boost::regex e(regx);
      boost::smatch what;
      std::cout << "Expression:  \"" << regx << "\"\n";
      std::cout << "Text:        \"" << text << "\"\n";
      if(boost::regex_match(text, what, e, boost::match_extra))
      {
         unsigned i, j;
         std::cout << "** Match found **\n   Sub-Expressions:\n";
         for(i = 0; i < what.size(); ++i)
            std::cout << "      $" << i << " = \"" << what[i] << "\"\n";
         std::cout << "   Captures:\n";
         for(i = 0; i < what.size(); ++i)
         {
            std::cout << "      $" << i << " = {";
            for(j = 0; j < what.captures(i).size(); ++j)
            {
               if(j)
                  std::cout << ", ";
               else
                  std::cout << " ";
               std::cout << "\"" << what.captures(i)[j] << "\"";
            }
            std::cout << " }\n";
         }
      }
      else
      {
         std::cout << "** No Match found **\n";
      }
   }

   int main(int , char* [])
   {
      print_captures("(([[:lower:]]+)|([[:upper:]]+))+", "aBBcccDDDDDeeeeeeee");
      print_captures("(.*)bar|(.*)bah", "abcbar");
      print_captures("(.*)bar|(.*)bah", "abcbah");
      print_captures("^(?:(\\w+)|(?>\\W+))*$", 
         "now is the time for all good men to come to the aid of the party");
      return 0;
   }

Which produces the following output:

[pre 
Expression:  "((\[\[:lower:\]\]+)|(\[\[:upper:\]\]+))+"
Text:        "aBBcccDDDDDeeeeeeee"
'''**''' Match found '''**'''
   Sub-Expressions:
      $0 = "aBBcccDDDDDeeeeeeee"
      $1 = "eeeeeeee"
      $2 = "eeeeeeee"
      $3 = "DDDDD"
   Captures:
      $0 = { "aBBcccDDDDDeeeeeeee" }
      $1 = { "a", "BB", "ccc", "DDDDD", "eeeeeeee" }
      $2 = { "a", "ccc", "eeeeeeee" }
      $3 = { "BB", "DDDDD" }
Expression:  "(.'''*''')bar|(.'''*''')bah"
Text:        "abcbar"
'''**''' Match found '''**'''
   Sub-Expressions:
      $0 = "abcbar"
      $1 = "abc"
      $2 = ""
   Captures:
      $0 = { "abcbar" }
      $1 = { "abc" }
      $2 = { }
Expression:  "(.'''*''')bar|(.'''*''')bah"
Text:        "abcbah"
'''**''' Match found '''**'''
   Sub-Expressions:
      $0 = "abcbah"
      $1 = ""
      $2 = "abc"
   Captures:
      $0 = { "abcbah" }
      $1 = { }
      $2 = { "abc" }
Expression:  "^(?:(\w+)|(?>\W+))'''*$'''"
Text:        "now is the time for all good men to come to the aid of the party"
'''**''' Match found '''**'''
   Sub-Expressions:
      $0 = "now is the time for all good men to come to the aid of the party"
      $1 = "party"
   Captures:
      $0 = { "now is the time for all good men to come to the aid of the party" }
      $1 = { "now", "is", "the", "time", "for", "all", "good", "men", "to", 
         "come", "to", "the", "aid", "of", "the", "party" }
]

Unfortunately enabling this feature has an impact on performance 
(even if you don't use it), and a much bigger impact if you do use it, 
therefore to use this feature you need to:

* Define BOOST_REGEX_MATCH_EXTRA for all translation units including the library source (the best way to do this is to uncomment this define in boost/regex/user.hpp and then rebuild everything.
* Pass the match_extra flag to the particular algorithms where you actually need the captures information (regex_search, regex_match, or regex_iterator).
    
[endsect]
Commit	Line	Data
7c673cae FG	1	[/
	2	Copyright 2006-2007 John Maddock.
	3	Distributed under the Boost Software License, Version 1.0.
	4	(See accompanying file LICENSE_1_0.txt or copy at
	5	http://www.boost.org/LICENSE_1_0.txt).
	6	]
	7
	8
	9	[section:captures Understanding Marked Sub-Expressions and Captures]
	10
	11	Captures are the iterator ranges that are "captured" by marked
	12	sub-expressions as a regular expression gets matched. Each marked
	13	sub-expression can result in more than one capture, if it is matched
	14	more than once. This document explains how captures and marked
	15	sub-expressions in Boost.Regex are represented and accessed.
	16
	17	[h4 Marked sub-expressions]
	18
	19	Every time a Perl regular expression contains a parenthesis group `()`, it
	20	spits out an extra field, known as a marked sub-expression,
	21	for example the expression:
	22
	23	[pre (\w+)\W+(\w+)]
	24
	25	Has two marked sub-expressions (known as $1 and $2 respectively), in
	26	addition the complete match is known as $&, everything before the
	27	first match as $\`, and everything after the match as $'. So
	28	if the above expression is searched for within `"@abc def--"`, then we obtain:
	29
	30	[table
	31	[[Sub-expression][Text found]]
	32	[[$\`]["@"]]
	33	[[$&]["abc def"]]
	34	[[$1]["abc"]]
	35	[[$2]["def"]]
	36	[[$']["--"]]
	37	]
	38
	39	In Boost.Regex all these are accessible via the [match_results] class that
	40	gets filled in when calling one of the regular expression matching algorithms
	41	([regex_search], [regex_match], or [regex_iterator]). So given:
	42
	43	boost::match_results<IteratorType> m;
	44
	45	The Perl and Boost.Regex equivalents are as follows:
	46
	47	[table
	48	[[Perl][Boost.Regex]]
	49	[[$\`][`m.prefix()`]]
	50	[[$&][`m[0]`]]
	51	[[$n][`m[n]`]]
	52	[[$\'][`m.suffix()`]]
	53	]
	54
	55	In Boost.Regex each sub-expression match is represented by a [sub_match] object,
	56	this is basically just a pair of iterators denoting the start and end
	57	position of the sub-expression match, but there are some additional
	58	operators provided so that objects of type [sub_match] behave a lot like a
	59	`std::basic_string`: for example they are implicitly convertible to a
	60	`basic_string`, they can be compared to a string, added to a string, or
	61	streamed out to an output stream.
	62
	63	[h4 Unmatched Sub-Expressions]
	64
65	When a regular expression match is found there is no need for all of the
66	marked sub-expressions to have participated in the match, for example the expression:
67
68	[pre (abc)\|(def)]
69
70	can match either $1 or $2, but never both at the same time. In Boost.Regex
71	you can determine which sub-expressions matched by accessing the
72	`sub_match::matched` data member.
73
74	[h4 Repeated Captures]
75
76	When a marked sub-expression is repeated, then the sub-expression gets
77	"captured" multiple times, however normally only the final capture is available,
78	for example if
79
80	[pre (?:(\w+)\W+)+]
81
82	is matched against
83
84	[pre one fine day]
85
86	Then $1 will contain the string "day", and all the previous captures will have
87	been forgotten.
88
89	However, Boost.Regex has an experimental feature that allows all the capture
90	information to be retained - this is accessed either via the
91	`match_results::captures` member function or the `sub_match::captures` member
92	function. These functions return a container that contains a sequence of all
93	the captures obtained during the regular expression matching. The following
94	example program shows how this information may be used:
95
96	#include <boost/regex.hpp>
97	#include <iostream>
98
99	void print_captures(const std::string& regx, const std::string& text)
100	{
101	boost::regex e(regx);
102	boost::smatch what;
103	std::cout << "Expression: \"" << regx << "\"\n";
104	std::cout << "Text: \"" << text << "\"\n";
105	if(boost::regex_match(text, what, e, boost::match_extra))
106	{
107	unsigned i, j;
108	std::cout << " Match found \n Sub-Expressions:\n";
109	for(i = 0; i < what.size(); ++i)
110	std::cout << " $" << i << " = \"" << what[i] << "\"\n";
111	std::cout << " Captures:\n";
112	for(i = 0; i < what.size(); ++i)
113	{
114	std::cout << " $" << i << " = {";
115	for(j = 0; j < what.captures(i).size(); ++j)
116	{
117	if(j)
118	std::cout << ", ";
119	else
120	std::cout << " ";
121	std::cout << "\"" << what.captures(i)[j] << "\"";
122	}
123	std::cout << " }\n";
124	}
125	}
126	else
127	{
128	std::cout << " No Match found \n";
129	}
130	}
131
132	int main(int , char* [])
133	{
134	print_captures("(([[:lower:]]+)\|([[:upper:]]+))+", "aBBcccDDDDDeeeeeeee");
135	print_captures("(.)bar\|(.)bah", "abcbar");
136	print_captures("(.)bar\|(.)bah", "abcbah");
137	print_captures("^(?:(\\w+)\|(?>\\W+))*$",
138	"now is the time for all good men to come to the aid of the party");
139	return 0;
140	}
141
142	Which produces the following output:
143
144	[pre
145	Expression: "((\[\[:lower:\]\]+)\|(\[\[:upper:\]\]+))+"
146	Text: "aBBcccDDDDDeeeeeeee"
147	'''''' Match found ''''''
148	Sub-Expressions:
149	$0 = "aBBcccDDDDDeeeeeeee"
150	$1 = "eeeeeeee"
151	$2 = "eeeeeeee"
152	$3 = "DDDDD"
153	Captures:
154	$0 = { "aBBcccDDDDDeeeeeeee" }
155	$1 = { "a", "BB", "ccc", "DDDDD", "eeeeeeee" }
156	$2 = { "a", "ccc", "eeeeeeee" }
157	$3 = { "BB", "DDDDD" }
158	Expression: "(.'''''')bar\|(.'''''')bah"
159	Text: "abcbar"
160	'''''' Match found ''''''
161	Sub-Expressions:
162	$0 = "abcbar"
163	$1 = "abc"
164	$2 = ""
165	Captures:
166	$0 = { "abcbar" }
167	$1 = { "abc" }
168	$2 = { }
169	Expression: "(.'''''')bar\|(.'''''')bah"
170	Text: "abcbah"
171	'''''' Match found ''''''
172	Sub-Expressions:
173	$0 = "abcbah"
174	$1 = ""
175	$2 = "abc"
176	Captures:
177	$0 = { "abcbah" }
178	$1 = { }
179	$2 = { "abc" }
180	Expression: "^(?:(\w+)\|(?>\W+))'''*$'''"
181	Text: "now is the time for all good men to come to the aid of the party"
182	'''''' Match found ''''''
183	Sub-Expressions:
184	$0 = "now is the time for all good men to come to the aid of the party"
185	$1 = "party"
186	Captures:
187	$0 = { "now is the time for all good men to come to the aid of the party" }
188	$1 = { "now", "is", "the", "time", "for", "all", "good", "men", "to",
189	"come", "to", "the", "aid", "of", "the", "party" }
190	]
191
192	Unfortunately enabling this feature has an impact on performance
193	(even if you don't use it), and a much bigger impact if you do use it,
194	therefore to use this feature you need to:
195
196	* Define BOOST_REGEX_MATCH_EXTRA for all translation units including the library source (the best way to do this is to uncomment this define in boost/regex/user.hpp and then rebuild everything.
197	* Pass the match_extra flag to the particular algorithms where you actually need the captures information (regex_search, regex_match, or regex_iterator).
198
199	[endsect]
200