]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | [/ |
2 | / Copyright (c) 2008 Eric Niebler | |
3 | / | |
4 | / Distributed under the Boost Software License, Version 1.0. (See accompanying | |
5 | / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) | |
6 | /] | |
7 | ||
8 | [section Static Regexes] | |
9 | ||
10 | [h2 Overview] | |
11 | ||
12 | The feature that really sets xpressive apart from other C/C++ regular | |
13 | expression libraries is the ability to author a regular expression using C++ | |
14 | expressions. xpressive achieves this through operator overloading, using a | |
15 | technique called ['expression templates] to embed a mini-language dedicated | |
16 | to pattern matching within C++. These "static regexes" have many advantages | |
17 | over their string-based brethren. In particular, static regexes: | |
18 | ||
19 | * are syntax-checked at compile-time; they will never fail at run-time due to | |
20 | a syntax error. | |
21 | * can naturally refer to other C++ data and code, including other regexes, | |
22 | making it simple to build grammars out of regular expressions and bind | |
23 | user-defined actions that execute when parts of your regex match. | |
24 | * are statically bound for better inlining and optimization. Static regexes | |
25 | require no state tables, virtual functions, byte-code or calls through | |
26 | function pointers that cannot be resolved at compile time. | |
27 | * are not limited to searching for patterns in strings. You can declare a | |
28 | static regex that finds patterns in an array of integers, for instance. | |
29 | ||
30 | Since we compose static regexes using C++ expressions, we are constrained by | |
31 | the rules for legal C++ expressions. Unfortunately, that means that | |
32 | "classic" regular expression syntax cannot always be mapped cleanly into | |
33 | C++. Rather, we map the regex ['constructs], picking new syntax that is | |
34 | legal C++. | |
35 | ||
36 | [h2 Construction and Assignment] | |
37 | ||
38 | You create a static regex by assigning one to an object of type _basic_regex_. | |
39 | For instance, the following defines a regex that can be used to find patterns | |
40 | in objects of type `std::string`: | |
41 | ||
42 | sregex re = '$' >> +_d >> '.' >> _d >> _d; | |
43 | ||
44 | Assignment works similarly. | |
45 | ||
46 | [h2 Character and String Literals] | |
47 | ||
48 | In static regexes, character and string literals match themselves. For | |
49 | instance, in the regex above, `'$'` and `'.'` match the characters `'$'` and | |
50 | `'.'` respectively. Don't be confused by the fact that [^$] and [^.] are | |
51 | meta-characters in Perl. In xpressive, literals always represent themselves. | |
52 | ||
53 | When using literals in static regexes, you must take care that at least one | |
54 | operand is not a literal. For instance, the following are ['not] valid | |
55 | regexes: | |
56 | ||
57 | sregex re1 = 'a' >> 'b'; // ERROR! | |
58 | sregex re2 = +'a'; // ERROR! | |
59 | ||
60 | The two operands to the binary `>>` operator are both literals, and the | |
61 | operand of the unary `+` operator is also a literal, so these statements | |
62 | will call the native C++ binary right-shift and unary plus operators, | |
63 | respectively. That's not what we want. To get operator overloading to kick | |
64 | in, at least one operand must be a user-defined type. We can use xpressive's | |
65 | `as_xpr()` helper function to "taint" an expression with regex-ness, forcing | |
66 | operator overloading to find the correct operators. The two regexes above | |
67 | should be written as: | |
68 | ||
69 | sregex re1 = as_xpr('a') >> 'b'; // OK | |
70 | sregex re2 = +as_xpr('a'); // OK | |
71 | ||
72 | [h2 Sequencing and Alternation] | |
73 | ||
74 | As you've probably already noticed, sub-expressions in static regexes must | |
75 | be separated by the sequencing operator, `>>`. You can read this operator as | |
76 | "followed by". | |
77 | ||
78 | // Match an 'a' followed by a digit | |
79 | sregex re = 'a' >> _d; | |
80 | ||
81 | Alternation works just as it does in Perl with the `|` operator. You can | |
82 | read this operator as "or". For example: | |
83 | ||
84 | // match a digit character or a word character one or more times | |
85 | sregex re = +( _d | _w ); | |
86 | ||
87 | [h2 Grouping and Captures] | |
88 | ||
89 | In Perl, parentheses `()` have special meaning. They group, but as a | |
90 | side-effect they also create back\-references like [^$1] and [^$2]. In C++, | |
91 | parentheses only group \-\- there is no way to give them side\-effects. To | |
92 | get the same effect, we use the special `s1`, `s2`, etc. tokens. Assigning | |
93 | to one creates a back-reference. You can then use the back-reference later | |
94 | in your expression, like using [^\1] and [^\2] in Perl. For example, | |
95 | consider the following regex, which finds matching HTML tags: | |
96 | ||
97 | "<(\\w+)>.*?</\\1>" | |
98 | ||
99 | In static xpressive, this would be: | |
100 | ||
101 | '<' >> (s1= +_w) >> '>' >> -*_ >> "</" >> s1 >> '>' | |
102 | ||
103 | Notice how you capture a back-reference by assigning to `s1`, and then you | |
104 | use `s1` later in the pattern to find the matching end tag. | |
105 | ||
106 | [tip [*Grouping without capturing a back-reference] \n\n In | |
107 | xpressive, if you just want grouping without capturing a back-reference, you | |
108 | can just use `()` without `s1`. That is the equivalent of Perl's [^(?:)] | |
109 | non-capturing grouping construct.] | |
110 | ||
111 | [h2 Case-Insensitivity and Internationalization] | |
112 | ||
113 | Perl lets you make part of your regular expression case-insensitive by using | |
114 | the [^(?i:)] pattern modifier. xpressive also has a case-insensitivity | |
115 | pattern modifier, called `icase`. You can use it as follows: | |
116 | ||
117 | sregex re = "this" >> icase( "that" ); | |
118 | ||
119 | In this regular expression, `"this"` will be matched exactly, but `"that"` | |
120 | will be matched irrespective of case. | |
121 | ||
122 | Case-insensitive regular expressions raise the issue of | |
123 | internationalization: how should case-insensitive character comparisons be | |
124 | evaluated? Also, many character classes are locale-specific. Which | |
125 | characters are matched by `digit` and which are matched by `alpha`? The | |
126 | answer depends on the `std::locale` object the regular expression object is | |
127 | using. By default, all regular expression objects use the global locale. You | |
128 | can override the default by using the `imbue()` pattern modifier, as | |
129 | follows: | |
130 | ||
131 | std::locale my_locale = /* initialize a std::locale object */; | |
132 | sregex re = imbue( my_locale )( +alpha >> +digit ); | |
133 | ||
134 | This regular expression will evaluate `alpha` and `digit` according to | |
135 | `my_locale`. See the section on [link boost_xpressive.user_s_guide.localization_and_regex_traits | |
136 | Localization and Regex Traits] for more information about how to customize | |
137 | the behavior of your regexes. | |
138 | ||
139 | [h2 Static xpressive Syntax Cheat Sheet] | |
140 | ||
141 | The table below lists the familiar regex constructs and their equivalents in | |
142 | static xpressive. | |
143 | ||
144 | [def _s1_ [globalref boost::xpressive::s1 s1]] | |
145 | [def _bos_ [globalref boost::xpressive::bos bos]] | |
146 | [def _eos_ [globalref boost::xpressive::eos eos]] | |
147 | [def _b_ [globalref boost::xpressive::_b _b]] | |
148 | [def _n_ [globalref boost::xpressive::_n _n]] | |
149 | [def _ln_ [globalref boost::xpressive::_ln _ln]] | |
150 | [def _d_ [globalref boost::xpressive::_d _d]] | |
151 | [def _w_ [globalref boost::xpressive::_w _w]] | |
152 | [def _s_ [globalref boost::xpressive::_s _s]] | |
153 | [def _alnum_ [globalref boost::xpressive::alnum alnum]] | |
154 | [def _alpha_ [globalref boost::xpressive::alpha alpha]] | |
155 | [def _blank_ [globalref boost::xpressive::blank blank]] | |
156 | [def _cntrl_ [globalref boost::xpressive::cntrl cntrl]] | |
157 | [def _digit_ [globalref boost::xpressive::digit digit]] | |
158 | [def _graph_ [globalref boost::xpressive::graph graph]] | |
159 | [def _lower_ [globalref boost::xpressive::lower lower]] | |
160 | [def _print_ [globalref boost::xpressive::print print]] | |
161 | [def _punct_ [globalref boost::xpressive::punct punct]] | |
162 | [def _space_ [globalref boost::xpressive::space space]] | |
163 | [def _upper_ [globalref boost::xpressive::upper upper]] | |
164 | [def _xdigit_ [globalref boost::xpressive::xdigit xdigit]] | |
165 | [def _set_ [globalref boost::xpressive::set set]] | |
166 | [def _repeat_ [funcref boost::xpressive::repeat repeat]] | |
167 | [def _range_ [funcref boost::xpressive::range range]] | |
168 | [def _icase_ [funcref boost::xpressive::icase icase]] | |
169 | [def _before_ [funcref boost::xpressive::before before]] | |
170 | [def _after_ [funcref boost::xpressive::after after]] | |
171 | [def _keep_ [funcref boost::xpressive::keep keep]] | |
172 | ||
173 | [table Perl syntax vs. Static xpressive syntax | |
174 | [[Perl] [Static xpressive] [Meaning]] | |
175 | [[[^.]] [[globalref boost::xpressive::_ `_`]] [any character (assuming Perl's /s modifier).]] | |
176 | [[[^ab]] [`a >> b`] [sequencing of [^a] and [^b] sub-expressions.]] | |
177 | [[[^a|b]] [`a | b`] [alternation of [^a] and [^b] sub-expressions.]] | |
178 | [[[^(a)]] [`(_s1_= a)`] [group and capture a back-reference.]] | |
179 | [[[^(?:a)]] [`(a)`] [group and do not capture a back-reference.]] | |
180 | [[[^\1]] [`_s1_`] [a previously captured back-reference.]] | |
181 | [[[^a*]] [`*a`] [zero or more times, greedy.]] | |
182 | [[[^a+]] [`+a`] [one or more times, greedy.]] | |
183 | [[[^a?]] [`!a`] [zero or one time, greedy.]] | |
184 | [[[^a{n,m}]] [`_repeat_<n,m>(a)`] [between [^n] and [^m] times, greedy.]] | |
185 | [[[^a*?]] [`-*a`] [zero or more times, non-greedy.]] | |
186 | [[[^a+?]] [`-+a`] [one or more times, non-greedy.]] | |
187 | [[[^a??]] [`-!a`] [zero or one time, non-greedy.]] | |
188 | [[[^a{n,m}?]] [`-_repeat_<n,m>(a)`] [between [^n] and [^m] times, non-greedy.]] | |
189 | [[[^^]] [`_bos_`] [beginning of sequence assertion.]] | |
190 | [[[^$]] [`_eos_`] [end of sequence assertion.]] | |
191 | [[[^\b]] [`_b_`] [word boundary assertion.]] | |
192 | [[[^\B]] [`~_b_`] [not word boundary assertion.]] | |
193 | [[[^\\n]] [`_n_`] [literal newline.]] | |
194 | [[[^.]] [`~_n_`] [any character except a literal newline (without Perl's /s modifier).]] | |
195 | [[[^\\r?\\n|\\r]] [`_ln_`] [logical newline.]] | |
196 | [[[^\[^\\r\\n\]]] [`~_ln_`] [any single character not a logical newline.]] | |
197 | [[[^\w]] [`_w_`] [a word character, equivalent to set\[alnum | '_'\].]] | |
198 | [[[^\W]] [`~_w_`] [not a word character, equivalent to ~set\[alnum | '_'\].]] | |
199 | [[[^\d]] [`_d_`] [a digit character.]] | |
200 | [[[^\D]] [`~_d_`] [not a digit character.]] | |
201 | [[[^\s]] [`_s_`] [a space character.]] | |
202 | [[[^\S]] [`~_s_`] [not a space character.]] | |
203 | [[[^\[:alnum:\]]] [`_alnum_`] [an alpha-numeric character.]] | |
204 | [[[^\[:alpha:\]]] [`_alpha_`] [an alphabetic character.]] | |
205 | [[[^\[:blank:\]]] [`_blank_`] [a horizontal white-space character.]] | |
206 | [[[^\[:cntrl:\]]] [`_cntrl_`] [a control character.]] | |
207 | [[[^\[:digit:\]]] [`_digit_`] [a digit character.]] | |
208 | [[[^\[:graph:\]]] [`_graph_`] [a graphable character.]] | |
209 | [[[^\[:lower:\]]] [`_lower_`] [a lower-case character.]] | |
210 | [[[^\[:print:\]]] [`_print_`] [a printing character.]] | |
211 | [[[^\[:punct:\]]] [`_punct_`] [a punctuation character.]] | |
212 | [[[^\[:space:\]]] [`_space_`] [a white-space character.]] | |
213 | [[[^\[:upper:\]]] [`_upper_`] [an upper-case character.]] | |
214 | [[[^\[:xdigit:\]]] [`_xdigit_`] [a hexadecimal digit character.]] | |
215 | [[[^\[0-9\]]] [`_range_('0','9')`] [characters in range `'0'` through `'9'`.]] | |
216 | [[[^\[abc\]]] [`as_xpr('a') | 'b' |'c'`] [characters `'a'`, `'b'`, or `'c'`.]] | |
217 | [[[^\[abc\]]] [`(_set_= 'a','b','c')`] [['same as above]]] | |
218 | [[[^\[0-9abc\]]] [`_set_[ _range_('0','9') | 'a' | 'b' | 'c' ]`] [characters `'a'`, `'b'`, `'c'` or in range `'0'` through `'9'`.]] | |
219 | [[[^\[0-9abc\]]] [`_set_[ _range_('0','9') | (_set_= 'a','b','c') ]`] [['same as above]]] | |
220 | [[[^\[^abc\]]] [`~(_set_= 'a','b','c')`] [not characters `'a'`, `'b'`, or `'c'`.]] | |
221 | [[[^(?i:['stuff])]] [`_icase_(`[^['stuff]]`)`] [match ['stuff] disregarding case.]] | |
222 | [[[^(?>['stuff])]] [`_keep_(`[^['stuff]]`)`] [independent sub-expression, match ['stuff] and turn off backtracking.]] | |
223 | [[[^(?=['stuff])]] [`_before_(`[^['stuff]]`)`] [positive look-ahead assertion, match if before ['stuff] but don't include ['stuff] in the match.]] | |
224 | [[[^(?!['stuff])]] [`~_before_(`[^['stuff]]`)`] [negative look-ahead assertion, match if not before ['stuff].]] | |
225 | [[[^(?<=['stuff])]] [`_after_(`[^['stuff]]`)`] [positive look-behind assertion, match if after ['stuff] but don't include ['stuff] in the match. (['stuff] must be constant-width.)]] | |
226 | [[[^(?<!['stuff])]] [`~_after_(`[^['stuff]]`)`] [negative look-behind assertion, match if not after ['stuff]. (['stuff] must be constant-width.)]] | |
227 | [[[^(?P<['name]>['stuff])]] [`_mark_tag_ `[^['name]]`(`['n]`);`\n ...\n `(`[^['name]]`= `[^['stuff]]`)`] [Create a named capture.]] | |
228 | [[[^(?P=['name])]] [`_mark_tag_ `[^['name]]`(`['n]`);`\n ...\n [^['name]]] [Refer back to a previously created named capture.]] | |
229 | ] | |
230 | \n | |
231 | ||
232 | [endsect] |