2 Copyright 2006-2007 John Maddock.
3 Distributed under the Boost Software License, Version 1.0.
4 (See accompanying file LICENSE_1_0.txt or copy at
5 http://www.boost.org/LICENSE_1_0.txt).
9 [section:icu Working With Unicode and ICU String Types]
11 [section:intro Introduction to using Regex with ICU]
17 contains the data types and algorithms necessary for working with regular
18 expressions in a Unicode aware environment.
20 In order to use this header you will need the
21 [@http://www.ibm.com/software/globalization/icu/ ICU library], and you will need
22 to have built the Boost.Regex library with
23 [link boost_regex.install.building_with_unicode_and_icu_su ICU support enabled].
25 The header will enable you to:
27 * Create regular expressions that treat Unicode strings as sequences of UTF-32 code points.
28 * Create regular expressions that support various Unicode data properties, including character classification.
29 * Transparently search Unicode strings that are encoded as either UTF-8, UTF-16 or UTF-32.
33 [section:unicode_types Unicode regular expression types]
35 Header `<boost/regex/icu.hpp>` provides a regular expression traits class that
36 handles UTF-32 characters:
38 class icu_regex_traits;
40 and a regular expression type based upon that:
42 typedef basic_regex<UChar32,icu_regex_traits> u32regex;
44 The type `u32regex` is regular expression type to use for all Unicode
45 regular expressions; internally it uses UTF-32 code points, but can be
46 created from, and used to search, either UTF-8, or UTF-16 encoded strings
47 as well as UTF-32 ones.
49 The constructors, and assign member functions of `u32regex`, require UTF-32
50 encoded strings, but there are a series of overloaded algorithms called
51 `make_u32regex` which allow regular expressions to be created from
52 UTF-8, UTF-16, or UTF-32 encoded strings:
54 template <class InputIterator>
55 u32regex make_u32regex(InputIterator i,
57 boost::regex_constants::syntax_option_type opt);
59 [*Effects]: Creates a regular expression object from the iterator sequence \[i,j).
60 The character encoding of the sequence is determined based upon sizeof(*i):
61 1 implies UTF-8, 2 implies UTF-16, and 4 implies UTF-32.
63 u32regex make_u32regex(const char* p,
64 boost::regex_constants::syntax_option_type opt
65 = boost::regex_constants::perl);
67 [*Effects]: Creates a regular expression object from the Null-terminated
68 UTF-8 character sequence /p/.
70 u32regex make_u32regex(const unsigned char* p,
71 boost::regex_constants::syntax_option_type opt
72 = boost::regex_constants::perl);
74 [*Effects]: Creates a regular expression object from the Null-terminated UTF-8 character sequence p.
76 u32regex make_u32regex(const wchar_t* p,
77 boost::regex_constants::syntax_option_type opt
78 = boost::regex_constants::perl);
80 [*Effects]: Creates a regular expression object from the Null-terminated character sequence p. The character encoding of the sequence is determined based upon sizeof(wchar_t): 1 implies UTF-8, 2 implies UTF-16, and 4 implies UTF-32.
82 u32regex make_u32regex(const UChar* p,
83 boost::regex_constants::syntax_option_type opt
84 = boost::regex_constants::perl);
86 [*Effects]: Creates a regular expression object from the Null-terminated UTF-16 character sequence p.
88 template<class C, class T, class A>
89 u32regex make_u32regex(const std::basic_string<C, T, A>& s,
90 boost::regex_constants::syntax_option_type opt
91 = boost::regex_constants::perl);
93 [*Effects]: Creates a regular expression object from the string s. The character encoding of the string is determined based upon sizeof(C): 1 implies UTF-8, 2 implies UTF-16, and 4 implies UTF-32.
95 u32regex make_u32regex(const UnicodeString& s,
96 boost::regex_constants::syntax_option_type opt
97 = boost::regex_constants::perl);
99 [*Effects]: Creates a regular expression object from the UTF-16 encoding string s.
103 [section:unicode_algo Unicode Regular Expression Algorithms]
105 The regular expression algorithms [regex_match], [regex_search] and [regex_replace]
106 all expect that the character sequence upon which they operate,
107 is encoded in the same character encoding as the regular expression object
108 with which they are used. For Unicode regular expressions that behavior is
109 undesirable: while we may want to process the data in UTF-32 "chunks", the
110 actual data is much more likely to encoded as either UTF-8 or UTF-16.
111 Therefore the header <boost/regex/icu.hpp> provides a series of thin wrappers
112 around these algorithms, called `u32regex_match`, `u32regex_search`, and
113 `u32regex_replace`. These wrappers use iterator-adapters internally to
114 make external UTF-8 or UTF-16 data look as though it's really a UTF-32 sequence,
115 that can then be passed on to the "real" algorithm.
119 For each [regex_match] algorithm defined by `<boost/regex.hpp>`, then
120 `<boost/regex/icu.hpp>` defines an overloaded algorithm that takes the
121 same arguments, but which is called `u32regex_match`, and which will
122 accept UTF-8, UTF-16 or UTF-32 encoded data, as well as an
123 ICU UnicodeString as input.
125 Example: match a password, encoded in a UTF-16 UnicodeString:
128 // Find out if *password* meets our password requirements,
129 // as defined by the regular expression *requirements*.
131 bool is_valid_password(const UnicodeString& password, const UnicodeString& requirements)
133 return boost::u32regex_match(password, boost::make_u32regex(requirements));
136 Example: match a UTF-8 encoded filename:
139 // Extract filename part of a path from a UTF-8 encoded std::string and return the result
140 // as another std::string:
142 std::string get_filename(const std::string& path)
144 boost::u32regex r = boost::make_u32regex("(?:\\A|.*\\\\)([^\\\\]+)");
146 if(boost::u32regex_match(path, what, r))
148 // extract $1 as a std::string:
153 throw std::runtime_error("Invalid pathname");
159 For each [regex_search] algorithm defined by `<boost/regex.hpp>`, then
160 `<boost/regex/icu.hpp>` defines an overloaded algorithm that takes the
161 same arguments, but which is called `u32regex_search`, and which will
162 accept UTF-8, UTF-16 or UTF-32 encoded data, as well as an ICU
163 UnicodeString as input.
165 Example: search for a character sequence in a specific language block:
167 UnicodeString extract_greek(const UnicodeString& text)
169 // searches through some UTF-16 encoded text for a block encoded in Greek,
170 // this expression is imperfect, but the best we can do for now - searching
171 // for specific scripts is actually pretty hard to do right.
173 // Here we search for a character sequence that begins with a Greek letter,
174 // and continues with characters that are either not-letters ( [^[:L*:]] )
175 // or are characters in the Greek character block ( [\\x{370}-\\x{3FF}] ).
177 boost::u32regex r = boost::make_u32regex(
178 L"[\\x{370}-\\x{3FF}](?:[^[:L*:]]|[\\x{370}-\\x{3FF}])*");
179 boost::u16match what;
180 if(boost::u32regex_search(text, what, r))
182 // extract $0 as a UnicodeString:
183 return UnicodeString(what[0].first, what.length(0));
187 throw std::runtime_error("No Greek found!");
191 [h4 u32regex_replace]
193 For each [regex_replace] algorithm defined by `<boost/regex.hpp>`, then
194 `<boost/regex/icu.hpp>` defines an overloaded algorithm that takes
195 the same arguments, but which is called `u32regex_replace`, and which will
196 accept UTF-8, UTF-16 or UTF-32 encoded data, as well as an ICU
197 UnicodeString as input. The input sequence and the format string specifier
198 passed to the algorithm, can be encoded independently (for example one can
199 be UTF-8, the other in UTF-16), but the result string / output iterator
200 argument must use the same character encoding as the text being searched.
202 Example: Credit card number reformatting:
205 // Take a credit card number as a string of digits,
206 // and reformat it as a human readable string with "-"
207 // separating each group of four digit;,
208 // note that we're mixing a UTF-32 regex, with a UTF-16
209 // string and a UTF-8 format specifier, and it still all
212 const boost::u32regex e = boost::make_u32regex(
213 "\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z");
214 const char* human_format = "$1-$2-$3-$4";
216 UnicodeString human_readable_card_number(const UnicodeString& s)
218 return boost::u32regex_replace(s, e, human_format);
222 [section:unicode_iter Unicode Aware Regex Iterators]
224 [h4 u32regex_iterator]
226 Type `u32regex_iterator` is in all respects the same as [regex_iterator]
227 except that since the regular expression type is always `u32regex`
228 it only takes one template parameter (the iterator type). It also calls
229 `u32regex_search` internally, allowing it to interface correctly with
230 UTF-8, UTF-16, and UTF-32 data:
232 template <class BidirectionalIterator>
233 class u32regex_iterator
235 // for members see regex_iterator
238 typedef u32regex_iterator<const char*> utf8regex_iterator;
239 typedef u32regex_iterator<const UChar*> utf16regex_iterator;
240 typedef u32regex_iterator<const UChar32*> utf32regex_iterator;
242 In order to simplify the construction of a `u32regex_iterator` from a string,
243 there are a series of non-member helper functions called make_u32regex_iterator:
245 u32regex_iterator<const char*>
246 make_u32regex_iterator(const char* s,
248 regex_constants::match_flag_type m = regex_constants::match_default);
250 u32regex_iterator<const wchar_t*>
251 make_u32regex_iterator(const wchar_t* s,
253 regex_constants::match_flag_type m = regex_constants::match_default);
255 u32regex_iterator<const UChar*>
256 make_u32regex_iterator(const UChar* s,
258 regex_constants::match_flag_type m = regex_constants::match_default);
260 template <class charT, class Traits, class Alloc>
261 u32regex_iterator<typename std::basic_string<charT, Traits, Alloc>::const_iterator>
262 make_u32regex_iterator(const std::basic_string<charT, Traits, Alloc>& s,
264 regex_constants::match_flag_type m = regex_constants::match_default);
266 u32regex_iterator<const UChar*>
267 make_u32regex_iterator(const UnicodeString& s,
269 regex_constants::match_flag_type m = regex_constants::match_default);
271 Each of these overloads returns an iterator that enumerates all occurrences
272 of expression /e/, in text /s/, using match_flags /m/.
274 Example: search for international currency symbols, along with their associated numeric value:
276 void enumerate_currencies(const std::string& text)
278 // enumerate and print all the currency symbols, along
279 // with any associated numeric values:
281 "([[:Sc:]][[:Cf:][:Cc:][:Z*:]]*)?"
282 "([[:Nd:]]+(?:[[:Po:]][[:Nd:]]+)?)?"
285 "[[:Cf:][:Cc:][:Z*:]]*"
289 boost::u32regex r = boost::make_u32regex(re);
290 boost::u32regex_iterator<std::string::const_iterator>
291 i(boost::make_u32regex_iterator(text, r)), j;
294 std::cout << (*i)[0] << std::endl;
301 [/this doesn't format correctly as code:]
302 [pre enumerate_currencies(" $100.23 or '''£'''198.12 ");]
311 Provided of course that the input is encoded as UTF-8.
313 [h4 u32regex_token_iterator]
315 Type `u32regex_token_iterator` is in all respects the same as [regex_token_iterator]
316 except that since the regular expression type is always `u32regex` it only
317 takes one template parameter (the iterator type). It also calls
318 `u32regex_search` internally, allowing it to interface correctly with UTF-8,
319 UTF-16, and UTF-32 data:
321 template <class BidirectionalIterator>
322 class u32regex_token_iterator
324 // for members see regex_token_iterator
327 typedef u32regex_token_iterator<const char*> utf8regex_token_iterator;
328 typedef u32regex_token_iterator<const UChar*> utf16regex_token_iterator;
329 typedef u32regex_token_iterator<const UChar32*> utf32regex_token_iterator;
331 In order to simplify the construction of a `u32regex_token_iterator` from a string,
332 there are a series of non-member helper functions called `make_u32regex_token_iterator`:
334 u32regex_token_iterator<const char*>
335 make_u32regex_token_iterator(
339 regex_constants::match_flag_type m = regex_constants::match_default);
341 u32regex_token_iterator<const wchar_t*>
342 make_u32regex_token_iterator(
346 regex_constants::match_flag_type m = regex_constants::match_default);
348 u32regex_token_iterator<const UChar*>
349 make_u32regex_token_iterator(
353 regex_constants::match_flag_type m = regex_constants::match_default);
355 template <class charT, class Traits, class Alloc>
356 u32regex_token_iterator<typename std::basic_string<charT, Traits, Alloc>::const_iterator>
357 make_u32regex_token_iterator(
358 const std::basic_string<charT, Traits, Alloc>& s,
361 regex_constants::match_flag_type m = regex_constants::match_default);
363 u32regex_token_iterator<const UChar*>
364 make_u32regex_token_iterator(
365 const UnicodeString& s,
368 regex_constants::match_flag_type m = regex_constants::match_default);
370 Each of these overloads returns an iterator that enumerates all occurrences of
371 marked sub-expression sub in regular expression /e/, found in text /s/, using
374 template <std::size_t N>
375 u32regex_token_iterator<const char*>
376 make_u32regex_token_iterator(
379 const int (&submatch)[N],
380 regex_constants::match_flag_type m = regex_constants::match_default);
382 template <std::size_t N>
383 u32regex_token_iterator<const wchar_t*>
384 make_u32regex_token_iterator(
387 const int (&submatch)[N],
388 regex_constants::match_flag_type m = regex_constants::match_default);
390 template <std::size_t N>
391 u32regex_token_iterator<const UChar*>
392 make_u32regex_token_iterator(
395 const int (&submatch)[N],
396 regex_constants::match_flag_type m = regex_constants::match_default);
398 template <class charT, class Traits, class Alloc, std::size_t N>
399 u32regex_token_iterator<typename std::basic_string<charT, Traits, Alloc>::const_iterator>
400 make_u32regex_token_iterator(
401 const std::basic_string<charT, Traits, Alloc>& p,
403 const int (&submatch)[N],
404 regex_constants::match_flag_type m = regex_constants::match_default);
406 template <std::size_t N>
407 u32regex_token_iterator<const UChar*>
408 make_u32regex_token_iterator(
409 const UnicodeString& s,
411 const int (&submatch)[N],
412 regex_constants::match_flag_type m = regex_constants::match_default);
414 Each of these overloads returns an iterator that enumerates one sub-expression
415 for each submatch in regular expression /e/, found in text /s/, using match_flags /m/.
417 u32regex_token_iterator<const char*>
418 make_u32regex_token_iterator(
421 const std::vector<int>& submatch,
422 regex_constants::match_flag_type m = regex_constants::match_default);
424 u32regex_token_iterator<const wchar_t*>
425 make_u32regex_token_iterator(
428 const std::vector<int>& submatch,
429 regex_constants::match_flag_type m = regex_constants::match_default);
431 u32regex_token_iterator<const UChar*>
432 make_u32regex_token_iterator(
435 const std::vector<int>& submatch,
436 regex_constants::match_flag_type m = regex_constants::match_default);
438 template <class charT, class Traits, class Alloc>
439 u32regex_token_iterator<typename std::basic_string<charT, Traits, Alloc>::const_iterator>
440 make_u32regex_token_iterator(
441 const std::basic_string<charT, Traits, Alloc>& p,
443 const std::vector<int>& submatch,
444 regex_constants::match_flag_type m = regex_constants::match_default);
446 u32regex_token_iterator<const UChar*>
447 make_u32regex_token_iterator(
448 const UnicodeString& s,
450 const std::vector<int>& submatch,
451 regex_constants::match_flag_type m = regex_constants::match_default);
453 Each of these overloads returns an iterator that enumerates one sub-expression for
454 each submatch in regular expression /e/, found in text /s/, using match_flags /m/.
456 Example: search for international currency symbols, along with their associated numeric value:
458 void enumerate_currencies2(const std::string& text)
460 // enumerate and print all the currency symbols, along
461 // with any associated numeric values:
463 "([[:Sc:]][[:Cf:][:Cc:][:Z*:]]*)?"
464 "([[:Nd:]]+(?:[[:Po:]][[:Nd:]]+)?)?"
467 "[[:Cf:][:Cc:][:Z*:]]*"
471 boost::u32regex r = boost::make_u32regex(re);
472 boost::u32regex_token_iterator<std::string::const_iterator>
473 i(boost::make_u32regex_token_iterator(text, r, 1)), j;
476 std::cout << *i << std::endl;