]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | [/ |
2 | Copyright 2006-2007 John Maddock. | |
3 | Distributed under the Boost Software License, Version 1.0. | |
4 | (See accompanying file LICENSE_1_0.txt or copy at | |
5 | http://www.boost.org/LICENSE_1_0.txt). | |
6 | ] | |
7 | ||
8 | ||
9 | [section:unicode Unicode and Boost.Regex] | |
10 | ||
11 | There are two ways to use Boost.Regex with Unicode strings: | |
12 | ||
13 | [h4 Rely on wchar_t] | |
14 | ||
15 | If your platform's `wchar_t` type can hold Unicode strings, and your | |
16 | platform's C/C++ runtime correctly handles wide character constants | |
17 | (when passed to `std::iswspace` `std::iswlower` etc), then you can use | |
18 | `boost::wregex` to process Unicode. However, there are several | |
19 | disadvantages to this approach: | |
20 | ||
21 | * It's not portable: there's no guarantee on the width of `wchar_t`, or | |
22 | even whether the runtime treats wide characters as Unicode at all, | |
23 | most Windows compilers do so, but many Unix systems do not. | |
24 | * There's no support for Unicode-specific character classes: `[[:Nd:]]`, `[[:Po:]]` etc. | |
25 | * You can only search strings that are encoded as sequences of wide | |
26 | characters, it is not possible to search UTF-8, or even UTF-16 on many platforms. | |
27 | ||
28 | [h4 Use a Unicode Aware Regular Expression Type.] | |
29 | ||
30 | If you have the | |
31 | [@http://www.ibm.com/software/globalization/icu/ ICU library], then | |
32 | Boost.Regex can be | |
33 | [link boost_regex.install.building_with_unicode_and_icu_su | |
34 | configured to make use | |
35 | of it], and provide a distinct regular expression type (boost::u32regex), | |
36 | that supports both Unicode specific character properties, and the searching | |
37 | of text that is encoded in either UTF-8, UTF-16, or UTF-32. See: | |
38 | [link boost_regex.ref.non_std_strings.icu | |
39 | ICU string class support]. | |
40 | ||
41 | [endsect] | |
42 |