]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | [/ |
2 | Copyright 2006-2007 John Maddock. | |
3 | Distributed under the Boost Software License, Version 1.0. | |
4 | (See accompanying file LICENSE_1_0.txt or copy at | |
5 | http://www.boost.org/LICENSE_1_0.txt). | |
6 | ] | |
7 | ||
8 | [section Introduction and Overview] | |
9 | ||
10 | Regular expressions are a form of pattern-matching that are often used in | |
11 | text processing; many users will be familiar with the Unix utilities grep, sed | |
12 | and awk, and the programming language Perl, each of which make extensive use | |
13 | of regular expressions. Traditionally C++ users have been limited to the | |
14 | POSIX C API's for manipulating regular expressions, and while Boost.Regex does | |
15 | provide these API's, they do not represent the best way to use the library. | |
16 | For example Boost.Regex can cope with wide character strings, or search and | |
17 | replace operations (in a manner analogous to either sed or Perl), something | |
18 | that traditional C libraries can not do. | |
19 | ||
20 | The class [basic_regex] is the key class in this library; it represents a | |
21 | "machine readable" regular expression, and is very closely modeled on | |
22 | `std::basic_string`, think of it as a string plus the actual state-machine | |
23 | required by the regular expression algorithms. Like `std::basic_string` there | |
24 | are two typedefs that are almost always the means by which this class is referenced: | |
25 | ||
26 | namespace boost{ | |
27 | ||
28 | template <class charT, | |
29 | class traits = regex_traits<charT> > | |
30 | class basic_regex; | |
31 | ||
32 | typedef basic_regex<char> regex; | |
33 | typedef basic_regex<wchar_t> wregex; | |
34 | ||
35 | } | |
36 | ||
37 | To see how this library can be used, imagine that we are writing a credit | |
38 | card processing application. Credit card numbers generally come as a string | |
39 | of 16-digits, separated into groups of 4-digits, and separated by either a | |
40 | space or a hyphen. Before storing a credit card number in a database | |
41 | (not necessarily something your customers will appreciate!), we may want to | |
42 | verify that the number is in the correct format. To match any digit we could | |
43 | use the regular expression \[0-9\], however ranges of characters like this are | |
44 | actually locale dependent. Instead we should use the POSIX standard | |
45 | form \[\[:digit:\]\], or the Boost.Regex and Perl shorthand for this \\d (note | |
46 | that many older libraries tended to be hard-coded to the C-locale, | |
47 | consequently this was not an issue for them). That leaves us with the | |
48 | following regular expression to validate credit card number formats: | |
49 | ||
50 | [pre (\d{4}\[- \]){3}\d{4}] | |
51 | ||
52 | Here the parenthesis act to group (and mark for future reference) | |
53 | sub-expressions, and the {4} means "repeat exactly 4 times". This is an | |
54 | example of the extended regular expression syntax used by Perl, awk and egrep. | |
55 | Boost.Regex also supports the older "basic" syntax used by sed and grep, | |
56 | but this is generally less useful, unless you already have some basic regular | |
57 | expressions that you need to reuse. | |
58 | ||
59 | Now let's take that expression and place it in some C++ code to validate the | |
60 | format of a credit card number: | |
61 | ||
62 | bool validate_card_format(const std::string& s) | |
63 | { | |
64 | static const boost::regex e("(\\d{4}[- ]){3}\\d{4}"); | |
65 | return regex_match(s, e); | |
66 | } | |
67 | ||
68 | Note how we had to add some extra escapes to the expression: remember that | |
69 | the escape is seen once by the C++ compiler, before it gets to be seen by | |
70 | the regular expression engine, consequently escapes in regular expressions | |
71 | have to be doubled up when embedding them in C/C++ code. Also note that | |
72 | all the examples assume that your compiler supports argument-dependent | |
73 | lookup, if yours doesn't (for example VC6), then you will have to add some | |
74 | `boost::` prefixes to some of the function calls in the examples. | |
75 | ||
76 | Those of you who are familiar with credit card processing, will have realized | |
77 | that while the format used above is suitable for human readable card numbers, | |
78 | it does not represent the format required by online credit card systems; these | |
79 | require the number as a string of 16 (or possibly 15) digits, without any | |
80 | intervening spaces. What we need is a means to convert easily between the two | |
81 | formats, and this is where search and replace comes in. Those who are familiar | |
82 | with the utilities sed and Perl will already be ahead here; we need two | |
83 | strings - one a regular expression - the other a "format string" that provides | |
84 | a description of the text to replace the match with. In Boost.Regex this | |
85 | search and replace operation is performed with the algorithm [regex_replace], | |
86 | for our credit card example we can write two algorithms like this to | |
87 | provide the format conversions: | |
88 | ||
89 | // match any format with the regular expression: | |
90 | const boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z"); | |
91 | const std::string machine_format("\\1\\2\\3\\4"); | |
92 | const std::string human_format("\\1-\\2-\\3-\\4"); | |
93 | ||
94 | std::string machine_readable_card_number(const std::string s) | |
95 | { | |
96 | return regex_replace(s, e, machine_format, boost::match_default | boost::format_sed); | |
97 | } | |
98 | ||
99 | std::string human_readable_card_number(const std::string s) | |
100 | { | |
101 | return regex_replace(s, e, human_format, boost::match_default | boost::format_sed); | |
102 | } | |
103 | ||
104 | Here we've used marked sub-expressions in the regular expression to split out | |
105 | the four parts of the card number as separate fields, the format string then | |
106 | uses the sed-like syntax to replace the matched text with the reformatted version. | |
107 | ||
108 | In the examples above, we haven't directly manipulated the results of | |
109 | a regular expression match, however in general the result of a match contains | |
110 | a number of sub-expression matches in addition to the overall match. When the | |
111 | library needs to report a regular expression match it does so using an instance | |
112 | of the class [match_results], as before there are typedefs of this class for | |
113 | the most common cases: | |
114 | ||
115 | namespace boost{ | |
116 | ||
117 | typedef match_results<const char*> cmatch; | |
118 | typedef match_results<const wchar_t*> wcmatch; | |
119 | typedef match_results<std::string::const_iterator> smatch; | |
120 | typedef match_results<std::wstring::const_iterator> wsmatch; | |
121 | ||
122 | } | |
123 | ||
124 | The algorithms [regex_search] and [regex_match] make use of [match_results] | |
125 | to report what matched; the difference between these algorithms is that | |
126 | [regex_match] will only find matches that consume /all/ of the input text, | |
127 | where as [regex_search] will search for a match anywhere within the text being matched. | |
128 | ||
129 | Note that these algorithms are not restricted to searching regular C-strings, | |
130 | any bidirectional iterator type can be searched, allowing for the | |
131 | possibility of seamlessly searching almost any kind of data. | |
132 | ||
133 | For search and replace operations, in addition to the algorithm [regex_replace] | |
134 | that we have already seen, the [match_results] class has a `format` member that | |
135 | takes the result of a match and a format string, and produces a new string | |
136 | by merging the two. | |
137 | ||
138 | For iterating through all occurrences of an expression within a text, | |
139 | there are two iterator types: [regex_iterator] will enumerate over the | |
140 | [match_results] objects found, while [regex_token_iterator] will enumerate | |
141 | a series of strings (similar to perl style split operations). | |
142 | ||
143 | For those that dislike templates, there is a high level wrapper class | |
144 | [RegEx] that is an encapsulation of the lower level template code - it | |
145 | provides a simplified interface for those that don't need the full | |
146 | power of the library, and supports only narrow characters, and the | |
147 | "extended" regular expression syntax. This class is now deprecated as | |
148 | it does not form part of the regular expressions C++ standard library proposal. | |
149 | ||
150 | The POSIX API functions: [regcomp], [regexec], [regfree] and [regerr], | |
151 | are available in both narrow character and Unicode versions, and are | |
152 | provided for those who need compatibility with these API's. | |
153 | ||
154 | Finally, note that the library now has | |
155 | [link boost_regex.background_information.locale run-time localization support], | |
156 | and recognizes the full POSIX regular expression syntax - including | |
157 | advanced features like multi-character collating elements and equivalence | |
158 | classes - as well as providing compatibility with other regular expression | |
159 | libraries including GNU and BSD4 regex packages, PCRE and Perl 5. | |
160 | ||
161 | [endsect] | |
162 | ||
163 |