]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | <html> |
2 | <head> | |
3 | <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII"> | |
4 | <title>Unicode Regular Expression Algorithms</title> | |
5 | <link rel="stylesheet" href="../../../../../../../../doc/src/boostbook.css" type="text/css"> | |
6 | <meta name="generator" content="DocBook XSL Stylesheets V1.77.1"> | |
7 | <link rel="home" href="../../../../index.html" title="Boost.Regex 5.1.2"> | |
8 | <link rel="up" href="../icu.html" title="Working With Unicode and ICU String Types"> | |
9 | <link rel="prev" href="unicode_types.html" title="Unicode regular expression types"> | |
10 | <link rel="next" href="unicode_iter.html" title="Unicode Aware Regex Iterators"> | |
11 | </head> | |
12 | <body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"> | |
13 | <table cellpadding="2" width="100%"><tr> | |
14 | <td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../../../../../../../boost.png"></td> | |
15 | <td align="center"><a href="../../../../../../../../index.html">Home</a></td> | |
16 | <td align="center"><a href="../../../../../../../../libs/libraries.htm">Libraries</a></td> | |
17 | <td align="center"><a href="http://www.boost.org/users/people.html">People</a></td> | |
18 | <td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td> | |
19 | <td align="center"><a href="../../../../../../../../more/index.htm">More</a></td> | |
20 | </tr></table> | |
21 | <hr> | |
22 | <div class="spirit-nav"> | |
23 | <a accesskey="p" href="unicode_types.html"><img src="../../../../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../icu.html"><img src="../../../../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../../../../index.html"><img src="../../../../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="unicode_iter.html"><img src="../../../../../../../../doc/src/images/next.png" alt="Next"></a> | |
24 | </div> | |
25 | <div class="section"> | |
26 | <div class="titlepage"><div><div><h5 class="title"> | |
27 | <a name="boost_regex.ref.non_std_strings.icu.unicode_algo"></a><a class="link" href="unicode_algo.html" title="Unicode Regular Expression Algorithms">Unicode | |
28 | Regular Expression Algorithms</a> | |
29 | </h5></div></div></div> | |
30 | <p> | |
31 | The regular expression algorithms <a class="link" href="../../regex_match.html" title="regex_match"><code class="computeroutput"><span class="identifier">regex_match</span></code></a>, <a class="link" href="../../regex_search.html" title="regex_search"><code class="computeroutput"><span class="identifier">regex_search</span></code></a> and <a class="link" href="../../regex_replace.html" title="regex_replace"><code class="computeroutput"><span class="identifier">regex_replace</span></code></a> all expect that | |
32 | the character sequence upon which they operate, is encoded in the same | |
33 | character encoding as the regular expression object with which they are | |
34 | used. For Unicode regular expressions that behavior is undesirable: while | |
35 | we may want to process the data in UTF-32 "chunks", the actual | |
36 | data is much more likely to encoded as either UTF-8 or UTF-16. Therefore | |
37 | the header <boost/regex/icu.hpp> provides a series of thin wrappers | |
38 | around these algorithms, called <code class="computeroutput"><span class="identifier">u32regex_match</span></code>, | |
39 | <code class="computeroutput"><span class="identifier">u32regex_search</span></code>, and | |
40 | <code class="computeroutput"><span class="identifier">u32regex_replace</span></code>. These | |
41 | wrappers use iterator-adapters internally to make external UTF-8 or UTF-16 | |
42 | data look as though it's really a UTF-32 sequence, that can then be passed | |
43 | on to the "real" algorithm. | |
44 | </p> | |
45 | <h5> | |
46 | <a name="boost_regex.ref.non_std_strings.icu.unicode_algo.h0"></a> | |
47 | <span class="phrase"><a name="boost_regex.ref.non_std_strings.icu.unicode_algo.u32regex_match"></a></span><a class="link" href="unicode_algo.html#boost_regex.ref.non_std_strings.icu.unicode_algo.u32regex_match">u32regex_match</a> | |
48 | </h5> | |
49 | <p> | |
50 | For each <a class="link" href="../../regex_match.html" title="regex_match"><code class="computeroutput"><span class="identifier">regex_match</span></code></a> | |
51 | algorithm defined by <code class="computeroutput"><span class="special"><</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">regex</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">></span></code>, | |
52 | then <code class="computeroutput"><span class="special"><</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">regex</span><span class="special">/</span><span class="identifier">icu</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">></span></code> defines an overloaded algorithm that | |
53 | takes the same arguments, but which is called <code class="computeroutput"><span class="identifier">u32regex_match</span></code>, | |
54 | and which will accept UTF-8, UTF-16 or UTF-32 encoded data, as well as | |
55 | an ICU UnicodeString as input. | |
56 | </p> | |
57 | <p> | |
58 | Example: match a password, encoded in a UTF-16 UnicodeString: | |
59 | </p> | |
60 | <pre class="programlisting"><span class="comment">//</span> | |
61 | <span class="comment">// Find out if *password* meets our password requirements,</span> | |
62 | <span class="comment">// as defined by the regular expression *requirements*.</span> | |
63 | <span class="comment">//</span> | |
64 | <span class="keyword">bool</span> <span class="identifier">is_valid_password</span><span class="special">(</span><span class="keyword">const</span> <span class="identifier">UnicodeString</span><span class="special">&</span> <span class="identifier">password</span><span class="special">,</span> <span class="keyword">const</span> <span class="identifier">UnicodeString</span><span class="special">&</span> <span class="identifier">requirements</span><span class="special">)</span> | |
65 | <span class="special">{</span> | |
66 | <span class="keyword">return</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">u32regex_match</span><span class="special">(</span><span class="identifier">password</span><span class="special">,</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">make_u32regex</span><span class="special">(</span><span class="identifier">requirements</span><span class="special">));</span> | |
67 | <span class="special">}</span> | |
68 | </pre> | |
69 | <p> | |
70 | Example: match a UTF-8 encoded filename: | |
71 | </p> | |
72 | <pre class="programlisting"><span class="comment">//</span> | |
73 | <span class="comment">// Extract filename part of a path from a UTF-8 encoded std::string and return the result</span> | |
74 | <span class="comment">// as another std::string:</span> | |
75 | <span class="comment">//</span> | |
76 | <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">get_filename</span><span class="special">(</span><span class="keyword">const</span> <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span><span class="special">&</span> <span class="identifier">path</span><span class="special">)</span> | |
77 | <span class="special">{</span> | |
78 | <span class="identifier">boost</span><span class="special">::</span><span class="identifier">u32regex</span> <span class="identifier">r</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">make_u32regex</span><span class="special">(</span><span class="string">"(?:\\A|.*\\\\)([^\\\\]+)"</span><span class="special">);</span> | |
79 | <span class="identifier">boost</span><span class="special">::</span><span class="identifier">smatch</span> <span class="identifier">what</span><span class="special">;</span> | |
80 | <span class="keyword">if</span><span class="special">(</span><span class="identifier">boost</span><span class="special">::</span><span class="identifier">u32regex_match</span><span class="special">(</span><span class="identifier">path</span><span class="special">,</span> <span class="identifier">what</span><span class="special">,</span> <span class="identifier">r</span><span class="special">))</span> | |
81 | <span class="special">{</span> | |
82 | <span class="comment">// extract $1 as a std::string:</span> | |
83 | <span class="keyword">return</span> <span class="identifier">what</span><span class="special">.</span><span class="identifier">str</span><span class="special">(</span><span class="number">1</span><span class="special">);</span> | |
84 | <span class="special">}</span> | |
85 | <span class="keyword">else</span> | |
86 | <span class="special">{</span> | |
87 | <span class="keyword">throw</span> <span class="identifier">std</span><span class="special">::</span><span class="identifier">runtime_error</span><span class="special">(</span><span class="string">"Invalid pathname"</span><span class="special">);</span> | |
88 | <span class="special">}</span> | |
89 | <span class="special">}</span> | |
90 | </pre> | |
91 | <h5> | |
92 | <a name="boost_regex.ref.non_std_strings.icu.unicode_algo.h1"></a> | |
93 | <span class="phrase"><a name="boost_regex.ref.non_std_strings.icu.unicode_algo.u32regex_search"></a></span><a class="link" href="unicode_algo.html#boost_regex.ref.non_std_strings.icu.unicode_algo.u32regex_search">u32regex_search</a> | |
94 | </h5> | |
95 | <p> | |
96 | For each <a class="link" href="../../regex_search.html" title="regex_search"><code class="computeroutput"><span class="identifier">regex_search</span></code></a> | |
97 | algorithm defined by <code class="computeroutput"><span class="special"><</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">regex</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">></span></code>, | |
98 | then <code class="computeroutput"><span class="special"><</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">regex</span><span class="special">/</span><span class="identifier">icu</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">></span></code> defines an overloaded algorithm that | |
99 | takes the same arguments, but which is called <code class="computeroutput"><span class="identifier">u32regex_search</span></code>, | |
100 | and which will accept UTF-8, UTF-16 or UTF-32 encoded data, as well as | |
101 | an ICU UnicodeString as input. | |
102 | </p> | |
103 | <p> | |
104 | Example: search for a character sequence in a specific language block: | |
105 | </p> | |
106 | <pre class="programlisting"><span class="identifier">UnicodeString</span> <span class="identifier">extract_greek</span><span class="special">(</span><span class="keyword">const</span> <span class="identifier">UnicodeString</span><span class="special">&</span> <span class="identifier">text</span><span class="special">)</span> | |
107 | <span class="special">{</span> | |
108 | <span class="comment">// searches through some UTF-16 encoded text for a block encoded in Greek,</span> | |
109 | <span class="comment">// this expression is imperfect, but the best we can do for now - searching</span> | |
110 | <span class="comment">// for specific scripts is actually pretty hard to do right.</span> | |
111 | <span class="comment">//</span> | |
112 | <span class="comment">// Here we search for a character sequence that begins with a Greek letter,</span> | |
113 | <span class="comment">// and continues with characters that are either not-letters ( [^[:L*:]] )</span> | |
114 | <span class="comment">// or are characters in the Greek character block ( [\\x{370}-\\x{3FF}] ).</span> | |
115 | <span class="comment">//</span> | |
116 | <span class="identifier">boost</span><span class="special">::</span><span class="identifier">u32regex</span> <span class="identifier">r</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">make_u32regex</span><span class="special">(</span> | |
117 | <span class="identifier">L</span><span class="string">"[\\x{370}-\\x{3FF}](?:[^[:L*:]]|[\\x{370}-\\x{3FF}])*"</span><span class="special">);</span> | |
118 | <span class="identifier">boost</span><span class="special">::</span><span class="identifier">u16match</span> <span class="identifier">what</span><span class="special">;</span> | |
119 | <span class="keyword">if</span><span class="special">(</span><span class="identifier">boost</span><span class="special">::</span><span class="identifier">u32regex_search</span><span class="special">(</span><span class="identifier">text</span><span class="special">,</span> <span class="identifier">what</span><span class="special">,</span> <span class="identifier">r</span><span class="special">))</span> | |
120 | <span class="special">{</span> | |
121 | <span class="comment">// extract $0 as a UnicodeString:</span> | |
122 | <span class="keyword">return</span> <span class="identifier">UnicodeString</span><span class="special">(</span><span class="identifier">what</span><span class="special">[</span><span class="number">0</span><span class="special">].</span><span class="identifier">first</span><span class="special">,</span> <span class="identifier">what</span><span class="special">.</span><span class="identifier">length</span><span class="special">(</span><span class="number">0</span><span class="special">));</span> | |
123 | <span class="special">}</span> | |
124 | <span class="keyword">else</span> | |
125 | <span class="special">{</span> | |
126 | <span class="keyword">throw</span> <span class="identifier">std</span><span class="special">::</span><span class="identifier">runtime_error</span><span class="special">(</span><span class="string">"No Greek found!"</span><span class="special">);</span> | |
127 | <span class="special">}</span> | |
128 | <span class="special">}</span> | |
129 | </pre> | |
130 | <h5> | |
131 | <a name="boost_regex.ref.non_std_strings.icu.unicode_algo.h2"></a> | |
132 | <span class="phrase"><a name="boost_regex.ref.non_std_strings.icu.unicode_algo.u32regex_replace"></a></span><a class="link" href="unicode_algo.html#boost_regex.ref.non_std_strings.icu.unicode_algo.u32regex_replace">u32regex_replace</a> | |
133 | </h5> | |
134 | <p> | |
135 | For each <a class="link" href="../../regex_replace.html" title="regex_replace"><code class="computeroutput"><span class="identifier">regex_replace</span></code></a> algorithm defined | |
136 | by <code class="computeroutput"><span class="special"><</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">regex</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">></span></code>, then <code class="computeroutput"><span class="special"><</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">regex</span><span class="special">/</span><span class="identifier">icu</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">></span></code> | |
137 | defines an overloaded algorithm that takes the same arguments, but which | |
138 | is called <code class="computeroutput"><span class="identifier">u32regex_replace</span></code>, | |
139 | and which will accept UTF-8, UTF-16 or UTF-32 encoded data, as well as | |
140 | an ICU UnicodeString as input. The input sequence and the format string | |
141 | specifier passed to the algorithm, can be encoded independently (for | |
142 | example one can be UTF-8, the other in UTF-16), but the result string | |
143 | / output iterator argument must use the same character encoding as the | |
144 | text being searched. | |
145 | </p> | |
146 | <p> | |
147 | Example: Credit card number reformatting: | |
148 | </p> | |
149 | <pre class="programlisting"><span class="comment">//</span> | |
150 | <span class="comment">// Take a credit card number as a string of digits, </span> | |
151 | <span class="comment">// and reformat it as a human readable string with "-"</span> | |
152 | <span class="comment">// separating each group of four digit;, </span> | |
153 | <span class="comment">// note that we're mixing a UTF-32 regex, with a UTF-16</span> | |
154 | <span class="comment">// string and a UTF-8 format specifier, and it still all </span> | |
155 | <span class="comment">// just works:</span> | |
156 | <span class="comment">//</span> | |
157 | <span class="keyword">const</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">u32regex</span> <span class="identifier">e</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">make_u32regex</span><span class="special">(</span> | |
158 | <span class="string">"\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z"</span><span class="special">);</span> | |
159 | <span class="keyword">const</span> <span class="keyword">char</span><span class="special">*</span> <span class="identifier">human_format</span> <span class="special">=</span> <span class="string">"$1-$2-$3-$4"</span><span class="special">;</span> | |
160 | ||
161 | <span class="identifier">UnicodeString</span> <span class="identifier">human_readable_card_number</span><span class="special">(</span><span class="keyword">const</span> <span class="identifier">UnicodeString</span><span class="special">&</span> <span class="identifier">s</span><span class="special">)</span> | |
162 | <span class="special">{</span> | |
163 | <span class="keyword">return</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">u32regex_replace</span><span class="special">(</span><span class="identifier">s</span><span class="special">,</span> <span class="identifier">e</span><span class="special">,</span> <span class="identifier">human_format</span><span class="special">);</span> | |
164 | <span class="special">}</span> | |
165 | </pre> | |
166 | </div> | |
167 | <table xmlns:rev="http://www.cs.rpi.edu/~gregod/boost/tools/doc/revision" width="100%"><tr> | |
168 | <td align="left"></td> | |
169 | <td align="right"><div class="copyright-footer">Copyright © 1998-2013 John Maddock<p> | |
170 | Distributed under the Boost Software License, Version 1.0. (See accompanying | |
171 | file LICENSE_1_0.txt or copy at <a href="http://www.boost.org/LICENSE_1_0.txt" target="_top">http://www.boost.org/LICENSE_1_0.txt</a>) | |
172 | </p> | |
173 | </div></td> | |
174 | </tr></table> | |
175 | <hr> | |
176 | <div class="spirit-nav"> | |
177 | <a accesskey="p" href="unicode_types.html"><img src="../../../../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../icu.html"><img src="../../../../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../../../../index.html"><img src="../../../../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="unicode_iter.html"><img src="../../../../../../../../doc/src/images/next.png" alt="Next"></a> | |
178 | </div> | |
179 | </body> | |
180 | </html> |