]>
Commit | Line | Data |
---|---|---|
1 | <html> | |
2 | <head> | |
3 | <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII"> | |
4 | <title>Understanding Marked Sub-Expressions and Captures</title> | |
5 | <link rel="stylesheet" href="../../../../../doc/src/boostbook.css" type="text/css"> | |
6 | <meta name="generator" content="DocBook XSL Stylesheets V1.77.1"> | |
7 | <link rel="home" href="../index.html" title="Boost.Regex 5.1.2"> | |
8 | <link rel="up" href="../index.html" title="Boost.Regex 5.1.2"> | |
9 | <link rel="prev" href="unicode.html" title="Unicode and Boost.Regex"> | |
10 | <link rel="next" href="partial_matches.html" title="Partial Matches"> | |
11 | </head> | |
12 | <body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"> | |
13 | <table cellpadding="2" width="100%"><tr> | |
14 | <td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../../../../boost.png"></td> | |
15 | <td align="center"><a href="../../../../../index.html">Home</a></td> | |
16 | <td align="center"><a href="../../../../../libs/libraries.htm">Libraries</a></td> | |
17 | <td align="center"><a href="http://www.boost.org/users/people.html">People</a></td> | |
18 | <td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td> | |
19 | <td align="center"><a href="../../../../../more/index.htm">More</a></td> | |
20 | </tr></table> | |
21 | <hr> | |
22 | <div class="spirit-nav"> | |
23 | <a accesskey="p" href="unicode.html"><img src="../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../index.html"><img src="../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="partial_matches.html"><img src="../../../../../doc/src/images/next.png" alt="Next"></a> | |
24 | </div> | |
25 | <div class="section"> | |
26 | <div class="titlepage"><div><div><h2 class="title" style="clear: both"> | |
27 | <a name="boost_regex.captures"></a><a class="link" href="captures.html" title="Understanding Marked Sub-Expressions and Captures">Understanding Marked Sub-Expressions | |
28 | and Captures</a> | |
29 | </h2></div></div></div> | |
30 | <p> | |
31 | Captures are the iterator ranges that are "captured" by marked sub-expressions | |
32 | as a regular expression gets matched. Each marked sub-expression can result | |
33 | in more than one capture, if it is matched more than once. This document explains | |
34 | how captures and marked sub-expressions in Boost.Regex are represented and | |
35 | accessed. | |
36 | </p> | |
37 | <h5> | |
38 | <a name="boost_regex.captures.h0"></a> | |
39 | <span class="phrase"><a name="boost_regex.captures.marked_sub_expressions"></a></span><a class="link" href="captures.html#boost_regex.captures.marked_sub_expressions">Marked | |
40 | sub-expressions</a> | |
41 | </h5> | |
42 | <p> | |
43 | Every time a Perl regular expression contains a parenthesis group <code class="computeroutput"><span class="special">()</span></code>, it spits out an extra field, known as a | |
44 | marked sub-expression, for example the expression: | |
45 | </p> | |
46 | <pre class="programlisting">(\w+)\W+(\w+)</pre> | |
47 | <p> | |
48 | Has two marked sub-expressions (known as $1 and $2 respectively), in addition | |
49 | the complete match is known as $&, everything before the first match as | |
50 | $`, and everything after the match as $'. So if the above expression is searched | |
51 | for within <code class="computeroutput"><span class="string">"@abc def--"</span></code>, | |
52 | then we obtain: | |
53 | </p> | |
54 | <div class="informaltable"><table class="table"> | |
55 | <colgroup> | |
56 | <col> | |
57 | <col> | |
58 | </colgroup> | |
59 | <thead><tr> | |
60 | <th> | |
61 | <p> | |
62 | Sub-expression | |
63 | </p> | |
64 | </th> | |
65 | <th> | |
66 | <p> | |
67 | Text found | |
68 | </p> | |
69 | </th> | |
70 | </tr></thead> | |
71 | <tbody> | |
72 | <tr> | |
73 | <td> | |
74 | <p> | |
75 | $` | |
76 | </p> | |
77 | </td> | |
78 | <td> | |
79 | <p> | |
80 | "@" | |
81 | </p> | |
82 | </td> | |
83 | </tr> | |
84 | <tr> | |
85 | <td> | |
86 | <p> | |
87 | $& | |
88 | </p> | |
89 | </td> | |
90 | <td> | |
91 | <p> | |
92 | "abc def" | |
93 | </p> | |
94 | </td> | |
95 | </tr> | |
96 | <tr> | |
97 | <td> | |
98 | <p> | |
99 | $1 | |
100 | </p> | |
101 | </td> | |
102 | <td> | |
103 | <p> | |
104 | "abc" | |
105 | </p> | |
106 | </td> | |
107 | </tr> | |
108 | <tr> | |
109 | <td> | |
110 | <p> | |
111 | $2 | |
112 | </p> | |
113 | </td> | |
114 | <td> | |
115 | <p> | |
116 | "def" | |
117 | </p> | |
118 | </td> | |
119 | </tr> | |
120 | <tr> | |
121 | <td> | |
122 | <p> | |
123 | $' | |
124 | </p> | |
125 | </td> | |
126 | <td> | |
127 | <p> | |
128 | "--" | |
129 | </p> | |
130 | </td> | |
131 | </tr> | |
132 | </tbody> | |
133 | </table></div> | |
134 | <p> | |
135 | In Boost.Regex all these are accessible via the <a class="link" href="ref/match_results.html" title="match_results"><code class="computeroutput"><span class="identifier">match_results</span></code></a> class that gets filled | |
136 | in when calling one of the regular expression matching algorithms ( <a class="link" href="ref/regex_search.html" title="regex_search"><code class="computeroutput"><span class="identifier">regex_search</span></code></a>, <a class="link" href="ref/regex_match.html" title="regex_match"><code class="computeroutput"><span class="identifier">regex_match</span></code></a>, or <a class="link" href="ref/regex_iterator.html" title="regex_iterator"><code class="computeroutput"><span class="identifier">regex_iterator</span></code></a>). So given: | |
137 | </p> | |
138 | <pre class="programlisting"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">match_results</span><span class="special"><</span><span class="identifier">IteratorType</span><span class="special">></span> <span class="identifier">m</span><span class="special">;</span> | |
139 | </pre> | |
140 | <p> | |
141 | The Perl and Boost.Regex equivalents are as follows: | |
142 | </p> | |
143 | <div class="informaltable"><table class="table"> | |
144 | <colgroup> | |
145 | <col> | |
146 | <col> | |
147 | </colgroup> | |
148 | <thead><tr> | |
149 | <th> | |
150 | <p> | |
151 | Perl | |
152 | </p> | |
153 | </th> | |
154 | <th> | |
155 | <p> | |
156 | Boost.Regex | |
157 | </p> | |
158 | </th> | |
159 | </tr></thead> | |
160 | <tbody> | |
161 | <tr> | |
162 | <td> | |
163 | <p> | |
164 | $` | |
165 | </p> | |
166 | </td> | |
167 | <td> | |
168 | <p> | |
169 | <code class="computeroutput"><span class="identifier">m</span><span class="special">.</span><span class="identifier">prefix</span><span class="special">()</span></code> | |
170 | </p> | |
171 | </td> | |
172 | </tr> | |
173 | <tr> | |
174 | <td> | |
175 | <p> | |
176 | $& | |
177 | </p> | |
178 | </td> | |
179 | <td> | |
180 | <p> | |
181 | <code class="computeroutput"><span class="identifier">m</span><span class="special">[</span><span class="number">0</span><span class="special">]</span></code> | |
182 | </p> | |
183 | </td> | |
184 | </tr> | |
185 | <tr> | |
186 | <td> | |
187 | <p> | |
188 | $n | |
189 | </p> | |
190 | </td> | |
191 | <td> | |
192 | <p> | |
193 | <code class="computeroutput"><span class="identifier">m</span><span class="special">[</span><span class="identifier">n</span><span class="special">]</span></code> | |
194 | </p> | |
195 | </td> | |
196 | </tr> | |
197 | <tr> | |
198 | <td> | |
199 | <p> | |
200 | $' | |
201 | </p> | |
202 | </td> | |
203 | <td> | |
204 | <p> | |
205 | <code class="computeroutput"><span class="identifier">m</span><span class="special">.</span><span class="identifier">suffix</span><span class="special">()</span></code> | |
206 | </p> | |
207 | </td> | |
208 | </tr> | |
209 | </tbody> | |
210 | </table></div> | |
211 | <p> | |
212 | In Boost.Regex each sub-expression match is represented by a <a class="link" href="ref/sub_match.html" title="sub_match"><code class="computeroutput"><span class="identifier">sub_match</span></code></a> object, this is basically | |
213 | just a pair of iterators denoting the start and end position of the sub-expression | |
214 | match, but there are some additional operators provided so that objects of | |
215 | type <a class="link" href="ref/sub_match.html" title="sub_match"><code class="computeroutput"><span class="identifier">sub_match</span></code></a> | |
216 | behave a lot like a <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">basic_string</span></code>: for example they are implicitly | |
217 | convertible to a <code class="computeroutput"><span class="identifier">basic_string</span></code>, | |
218 | they can be compared to a string, added to a string, or streamed out to an | |
219 | output stream. | |
220 | </p> | |
221 | <h5> | |
222 | <a name="boost_regex.captures.h1"></a> | |
223 | <span class="phrase"><a name="boost_regex.captures.unmatched_sub_expressions"></a></span><a class="link" href="captures.html#boost_regex.captures.unmatched_sub_expressions">Unmatched | |
224 | Sub-Expressions</a> | |
225 | </h5> | |
226 | <p> | |
227 | When a regular expression match is found there is no need for all of the marked | |
228 | sub-expressions to have participated in the match, for example the expression: | |
229 | </p> | |
230 | <pre class="programlisting">(abc)|(def)</pre> | |
231 | <p> | |
232 | can match either $1 or $2, but never both at the same time. In Boost.Regex | |
233 | you can determine which sub-expressions matched by accessing the <code class="computeroutput"><span class="identifier">sub_match</span><span class="special">::</span><span class="identifier">matched</span></code> data member. | |
234 | </p> | |
235 | <h5> | |
236 | <a name="boost_regex.captures.h2"></a> | |
237 | <span class="phrase"><a name="boost_regex.captures.repeated_captures"></a></span><a class="link" href="captures.html#boost_regex.captures.repeated_captures">Repeated | |
238 | Captures</a> | |
239 | </h5> | |
240 | <p> | |
241 | When a marked sub-expression is repeated, then the sub-expression gets "captured" | |
242 | multiple times, however normally only the final capture is available, for example | |
243 | if | |
244 | </p> | |
245 | <pre class="programlisting">(?:(\w+)\W+)+</pre> | |
246 | <p> | |
247 | is matched against | |
248 | </p> | |
249 | <pre class="programlisting">one fine day</pre> | |
250 | <p> | |
251 | Then $1 will contain the string "day", and all the previous captures | |
252 | will have been forgotten. | |
253 | </p> | |
254 | <p> | |
255 | However, Boost.Regex has an experimental feature that allows all the capture | |
256 | information to be retained - this is accessed either via the <code class="computeroutput"><span class="identifier">match_results</span><span class="special">::</span><span class="identifier">captures</span></code> member function or the <code class="computeroutput"><span class="identifier">sub_match</span><span class="special">::</span><span class="identifier">captures</span></code> member function. These functions | |
257 | return a container that contains a sequence of all the captures obtained during | |
258 | the regular expression matching. The following example program shows how this | |
259 | information may be used: | |
260 | </p> | |
261 | <pre class="programlisting"><span class="preprocessor">#include</span> <span class="special"><</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">regex</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">></span> | |
262 | <span class="preprocessor">#include</span> <span class="special"><</span><span class="identifier">iostream</span><span class="special">></span> | |
263 | ||
264 | <span class="keyword">void</span> <span class="identifier">print_captures</span><span class="special">(</span><span class="keyword">const</span> <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span><span class="special">&</span> <span class="identifier">regx</span><span class="special">,</span> <span class="keyword">const</span> <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span><span class="special">&</span> <span class="identifier">text</span><span class="special">)</span> | |
265 | <span class="special">{</span> | |
266 | <span class="identifier">boost</span><span class="special">::</span><span class="identifier">regex</span> <span class="identifier">e</span><span class="special">(</span><span class="identifier">regx</span><span class="special">);</span> | |
267 | <span class="identifier">boost</span><span class="special">::</span><span class="identifier">smatch</span> <span class="identifier">what</span><span class="special">;</span> | |
268 | <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special"><<</span> <span class="string">"Expression: \""</span> <span class="special"><<</span> <span class="identifier">regx</span> <span class="special"><<</span> <span class="string">"\"\n"</span><span class="special">;</span> | |
269 | <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special"><<</span> <span class="string">"Text: \""</span> <span class="special"><<</span> <span class="identifier">text</span> <span class="special"><<</span> <span class="string">"\"\n"</span><span class="special">;</span> | |
270 | <span class="keyword">if</span><span class="special">(</span><span class="identifier">boost</span><span class="special">::</span><span class="identifier">regex_match</span><span class="special">(</span><span class="identifier">text</span><span class="special">,</span> <span class="identifier">what</span><span class="special">,</span> <span class="identifier">e</span><span class="special">,</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">match_extra</span><span class="special">))</span> | |
271 | <span class="special">{</span> | |
272 | <span class="keyword">unsigned</span> <span class="identifier">i</span><span class="special">,</span> <span class="identifier">j</span><span class="special">;</span> | |
273 | <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special"><<</span> <span class="string">"** Match found **\n Sub-Expressions:\n"</span><span class="special">;</span> | |
274 | <span class="keyword">for</span><span class="special">(</span><span class="identifier">i</span> <span class="special">=</span> <span class="number">0</span><span class="special">;</span> <span class="identifier">i</span> <span class="special"><</span> <span class="identifier">what</span><span class="special">.</span><span class="identifier">size</span><span class="special">();</span> <span class="special">++</span><span class="identifier">i</span><span class="special">)</span> | |
275 | <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special"><<</span> <span class="string">" $"</span> <span class="special"><<</span> <span class="identifier">i</span> <span class="special"><<</span> <span class="string">" = \""</span> <span class="special"><<</span> <span class="identifier">what</span><span class="special">[</span><span class="identifier">i</span><span class="special">]</span> <span class="special"><<</span> <span class="string">"\"\n"</span><span class="special">;</span> | |
276 | <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special"><<</span> <span class="string">" Captures:\n"</span><span class="special">;</span> | |
277 | <span class="keyword">for</span><span class="special">(</span><span class="identifier">i</span> <span class="special">=</span> <span class="number">0</span><span class="special">;</span> <span class="identifier">i</span> <span class="special"><</span> <span class="identifier">what</span><span class="special">.</span><span class="identifier">size</span><span class="special">();</span> <span class="special">++</span><span class="identifier">i</span><span class="special">)</span> | |
278 | <span class="special">{</span> | |
279 | <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special"><<</span> <span class="string">" $"</span> <span class="special"><<</span> <span class="identifier">i</span> <span class="special"><<</span> <span class="string">" = {"</span><span class="special">;</span> | |
280 | <span class="keyword">for</span><span class="special">(</span><span class="identifier">j</span> <span class="special">=</span> <span class="number">0</span><span class="special">;</span> <span class="identifier">j</span> <span class="special"><</span> <span class="identifier">what</span><span class="special">.</span><span class="identifier">captures</span><span class="special">(</span><span class="identifier">i</span><span class="special">).</span><span class="identifier">size</span><span class="special">();</span> <span class="special">++</span><span class="identifier">j</span><span class="special">)</span> | |
281 | <span class="special">{</span> | |
282 | <span class="keyword">if</span><span class="special">(</span><span class="identifier">j</span><span class="special">)</span> | |
283 | <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special"><<</span> <span class="string">", "</span><span class="special">;</span> | |
284 | <span class="keyword">else</span> | |
285 | <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special"><<</span> <span class="string">" "</span><span class="special">;</span> | |
286 | <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special"><<</span> <span class="string">"\""</span> <span class="special"><<</span> <span class="identifier">what</span><span class="special">.</span><span class="identifier">captures</span><span class="special">(</span><span class="identifier">i</span><span class="special">)[</span><span class="identifier">j</span><span class="special">]</span> <span class="special"><<</span> <span class="string">"\""</span><span class="special">;</span> | |
287 | <span class="special">}</span> | |
288 | <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special"><<</span> <span class="string">" }\n"</span><span class="special">;</span> | |
289 | <span class="special">}</span> | |
290 | <span class="special">}</span> | |
291 | <span class="keyword">else</span> | |
292 | <span class="special">{</span> | |
293 | <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special"><<</span> <span class="string">"** No Match found **\n"</span><span class="special">;</span> | |
294 | <span class="special">}</span> | |
295 | <span class="special">}</span> | |
296 | ||
297 | <span class="keyword">int</span> <span class="identifier">main</span><span class="special">(</span><span class="keyword">int</span> <span class="special">,</span> <span class="keyword">char</span><span class="special">*</span> <span class="special">[])</span> | |
298 | <span class="special">{</span> | |
299 | <span class="identifier">print_captures</span><span class="special">(</span><span class="string">"(([[:lower:]]+)|([[:upper:]]+))+"</span><span class="special">,</span> <span class="string">"aBBcccDDDDDeeeeeeee"</span><span class="special">);</span> | |
300 | <span class="identifier">print_captures</span><span class="special">(</span><span class="string">"(.*)bar|(.*)bah"</span><span class="special">,</span> <span class="string">"abcbar"</span><span class="special">);</span> | |
301 | <span class="identifier">print_captures</span><span class="special">(</span><span class="string">"(.*)bar|(.*)bah"</span><span class="special">,</span> <span class="string">"abcbah"</span><span class="special">);</span> | |
302 | <span class="identifier">print_captures</span><span class="special">(</span><span class="string">"^(?:(\\w+)|(?>\\W+))*$"</span><span class="special">,</span> | |
303 | <span class="string">"now is the time for all good men to come to the aid of the party"</span><span class="special">);</span> | |
304 | <span class="keyword">return</span> <span class="number">0</span><span class="special">;</span> | |
305 | <span class="special">}</span> | |
306 | </pre> | |
307 | <p> | |
308 | Which produces the following output: | |
309 | </p> | |
310 | <pre class="programlisting">Expression: "(([[:lower:]]+)|([[:upper:]]+))+" | |
311 | Text: "aBBcccDDDDDeeeeeeee" | |
312 | ** Match found ** | |
313 | Sub-Expressions: | |
314 | $0 = "aBBcccDDDDDeeeeeeee" | |
315 | $1 = "eeeeeeee" | |
316 | $2 = "eeeeeeee" | |
317 | $3 = "DDDDD" | |
318 | Captures: | |
319 | $0 = { "aBBcccDDDDDeeeeeeee" } | |
320 | $1 = { "a", "BB", "ccc", "DDDDD", "eeeeeeee" } | |
321 | $2 = { "a", "ccc", "eeeeeeee" } | |
322 | $3 = { "BB", "DDDDD" } | |
323 | Expression: "(.*)bar|(.*)bah" | |
324 | Text: "abcbar" | |
325 | ** Match found ** | |
326 | Sub-Expressions: | |
327 | $0 = "abcbar" | |
328 | $1 = "abc" | |
329 | $2 = "" | |
330 | Captures: | |
331 | $0 = { "abcbar" } | |
332 | $1 = { "abc" } | |
333 | $2 = { } | |
334 | Expression: "(.*)bar|(.*)bah" | |
335 | Text: "abcbah" | |
336 | ** Match found ** | |
337 | Sub-Expressions: | |
338 | $0 = "abcbah" | |
339 | $1 = "" | |
340 | $2 = "abc" | |
341 | Captures: | |
342 | $0 = { "abcbah" } | |
343 | $1 = { } | |
344 | $2 = { "abc" } | |
345 | Expression: "^(?:(\w+)|(?>\W+))*$" | |
346 | Text: "now is the time for all good men to come to the aid of the party" | |
347 | ** Match found ** | |
348 | Sub-Expressions: | |
349 | $0 = "now is the time for all good men to come to the aid of the party" | |
350 | $1 = "party" | |
351 | Captures: | |
352 | $0 = { "now is the time for all good men to come to the aid of the party" } | |
353 | $1 = { "now", "is", "the", "time", "for", "all", "good", "men", "to", | |
354 | "come", "to", "the", "aid", "of", "the", "party" } | |
355 | </pre> | |
356 | <p> | |
357 | Unfortunately enabling this feature has an impact on performance (even if you | |
358 | don't use it), and a much bigger impact if you do use it, therefore to use | |
359 | this feature you need to: | |
360 | </p> | |
361 | <div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "> | |
362 | <li class="listitem"> | |
363 | Define BOOST_REGEX_MATCH_EXTRA for all translation units including the | |
364 | library source (the best way to do this is to uncomment this define in | |
365 | boost/regex/user.hpp and then rebuild everything. | |
366 | </li> | |
367 | <li class="listitem"> | |
368 | Pass the match_extra flag to the particular algorithms where you actually | |
369 | need the captures information (regex_search, regex_match, or regex_iterator). | |
370 | </li> | |
371 | </ul></div> | |
372 | </div> | |
373 | <table xmlns:rev="http://www.cs.rpi.edu/~gregod/boost/tools/doc/revision" width="100%"><tr> | |
374 | <td align="left"></td> | |
375 | <td align="right"><div class="copyright-footer">Copyright © 1998-2013 John Maddock<p> | |
376 | Distributed under the Boost Software License, Version 1.0. (See accompanying | |
377 | file LICENSE_1_0.txt or copy at <a href="http://www.boost.org/LICENSE_1_0.txt" target="_top">http://www.boost.org/LICENSE_1_0.txt</a>) | |
378 | </p> | |
379 | </div></td> | |
380 | </tr></table> | |
381 | <hr> | |
382 | <div class="spirit-nav"> | |
383 | <a accesskey="p" href="unicode.html"><img src="../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../index.html"><img src="../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="partial_matches.html"><img src="../../../../../doc/src/images/next.png" alt="Next"></a> | |
384 | </div> | |
385 | </body> | |
386 | </html> |