]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | [/ |
2 | Copyright 2006-2007 John Maddock. | |
3 | Distributed under the Boost Software License, Version 1.0. | |
4 | (See accompanying file LICENSE_1_0.txt or copy at | |
5 | http://www.boost.org/LICENSE_1_0.txt). | |
6 | ] | |
7 | ||
8 | [section:locale Localization] | |
9 | ||
10 | Boost.Regex provides extensive support for run-time localization, the | |
11 | localization model used can be split into two parts: front-end and back-end. | |
12 | ||
13 | Front-end localization deals with everything which the user sees - | |
14 | error messages, and the regular expression syntax itself. For example a | |
15 | French application could change \[\[:word:\]\] to \[\[:mot:\]\] and \\w to \\m. | |
16 | Modifying the front end locale requires active support from the developer, | |
17 | by providing the library with a message catalogue to load, containing the | |
18 | localized strings. Front-end locale is affected by the LC_MESSAGES category only. | |
19 | ||
20 | Back-end localization deals with everything that occurs after the expression | |
21 | has been parsed - in other words everything that the user does not see or | |
22 | interact with directly. It deals with case conversion, collation, and character | |
23 | class membership. The back-end locale does not require any intervention from | |
24 | the developer - the library will acquire all the information it requires for | |
25 | the current locale from the underlying operating system / run time library. | |
26 | This means that if the program user does not interact with regular | |
27 | expressions directly - for example if the expressions are embedded in your | |
28 | C++ code - then no explicit localization is required, as the library will | |
29 | take care of everything for you. For example embedding the expression | |
30 | \[\[:word:\]\]+ in your code will always match a whole word, if the | |
31 | program is run on a machine with, for example, a Greek locale, then it | |
32 | will still match a whole word, but in Greek characters rather than Latin ones. | |
33 | The back-end locale is affected by the LC_TYPE and LC_COLLATE categories. | |
34 | ||
35 | There are three separate localization mechanisms supported by Boost.Regex: | |
36 | ||
37 | [h4 Win32 localization model.] | |
38 | ||
39 | This is the default model when the library is compiled under Win32, and is | |
40 | encapsulated by the traits class `w32_regex_traits`. When this model is in | |
41 | effect each [basic_regex] object gets it's own LCID, by default this is | |
42 | the users default setting as returned by GetUserDefaultLCID, but you can | |
43 | call imbue on the `basic_regex` object to set it's locale to some other | |
44 | LCID if you wish. All the settings used by Boost.Regex are acquired directly | |
45 | from the operating system bypassing the C run time library. Front-end | |
46 | localization requires a resource dll, containing a string table with the | |
47 | user-defined strings. The traits class exports the function: | |
48 | ||
49 | static std::string set_message_catalogue(const std::string& s); | |
50 | ||
51 | which needs to be called with a string identifying the name of the resource | |
52 | dll, before your code compiles any regular expressions (but not necessarily | |
53 | before you construct any `basic_regex` instances): | |
54 | ||
55 | boost::w32_regex_traits<char>::set_message_catalogue("mydll.dll"); | |
56 | ||
57 | The library provides full Unicode support under NT, under Windows 9x | |
58 | the library degrades gracefully - characters 0 to 255 are supported, the | |
59 | remainder are treated as "unknown" graphic characters. | |
60 | ||
61 | [h4 C localization model.] | |
62 | ||
63 | This model has been deprecated in favor of the C++ locale for all non-Windows | |
64 | compilers that support it. This locale is encapsulated by the traits class | |
65 | `c_regex_traits`, Win32 users can force this model to take effect by | |
66 | defining the pre-processor symbol BOOST_REGEX_USE_C_LOCALE. When this model is | |
67 | in effect there is a single global locale, as set by `setlocale`. All settings | |
68 | are acquired from your run time library, consequently Unicode support is | |
69 | dependent upon your run time library implementation. | |
70 | ||
71 | Front end localization is not supported. | |
72 | ||
73 | Note that calling setlocale invalidates all compiled regular expressions, | |
74 | calling `setlocale(LC_ALL, "C")` will make this library behave equivalent to | |
75 | most traditional regular expression libraries including version 1 of this library. | |
76 | ||
77 | [h4 C++ localization model.] | |
78 | ||
79 | This model is the default for non-Windows compilers. | |
80 | ||
81 | When this model is in effect each instance of [basic_regex] has its own | |
82 | instance of `std::locale`, class [basic_regex] also has a member function | |
83 | `imbue` which allows the locale for the expression to be set on a | |
84 | per-instance basis. Front end localization requires a POSIX message catalogue, | |
85 | which will be loaded via the `std::messages` facet of the expression's locale, | |
86 | the traits class exports the symbol: | |
87 | ||
88 | static std::string set_message_catalogue(const std::string& s); | |
89 | ||
90 | which needs to be called with a string identifying the name of the | |
91 | message catalogue, before your code compiles any regular expressions | |
92 | (but not necessarily before you construct any basic_regex instances): | |
93 | ||
94 | boost::cpp_regex_traits<char>::set_message_catalogue("mycatalogue"); | |
95 | ||
96 | Note that calling `basic_regex<>::imbue` will invalidate any expression | |
97 | currently compiled in that instance of [basic_regex]. | |
98 | ||
99 | Finally note that if you build the library with a non-default localization model, | |
100 | then the appropriate pre-processor symbol (BOOST_REGEX_USE_C_LOCALE or | |
101 | BOOST_REGEX_USE_CPP_LOCALE) must be defined both when you build the support | |
102 | library, and when you include `<boost/regex.hpp>` or `<boost/cregex.hpp>` | |
103 | in your code. The best way to ensure this is to add the #define to | |
104 | `<boost/regex/user.hpp>`. | |
105 | ||
106 | [h4 Providing a message catalogue] | |
107 | ||
108 | In order to localize the front end of the library, you need to provide the | |
109 | library with the appropriate message strings contained either in a resource | |
110 | dll's string table (Win32 model), or a POSIX message catalogue (C++ models). | |
111 | In the latter case the messages must appear in message set zero of the | |
112 | catalogue. The messages and their id's are as follows: | |
113 | ||
114 | [table | |
115 | [[Message][id][Meaning][Default value]] | |
116 | [[101][The character used to start a sub-expression.]["(" ]] | |
117 | [[102][The character used to end a sub-expression declaration.][")" ]] | |
118 | [[103][The character used to denote an end of line assertion.]["$" ]] | |
119 | [[104][The character used to denote the start of line assertion.]["^" ]] | |
120 | [[105][The character used to denote the "match any character expression".]["." ]] | |
121 | [[106][The match zero or more times repetition operator.]["*" ]] | |
122 | [[107][The match one or more repetition operator.]["+" ]] | |
123 | [[108][The match zero or one repetition operator.]["?" ]] | |
124 | [[109][The character set opening character.]["\[" ]] | |
125 | [[110][The character set closing character.]["\]" ]] | |
126 | [[111][The alternation operator.]["|" ]] | |
127 | [[112][The escape character.]["\\" ]] | |
128 | [[113][The hash character (not currently used).]["#" ]] | |
129 | [[114][The range operator.]["-" ]] | |
130 | [[115][The repetition operator opening character.]["{" ]] | |
131 | [[116][The repetition operator closing character.]["}" ]] | |
132 | [[117][The digit characters.]["0123456789" ]] | |
133 | [[118][The character which when preceded by an escape character represents the word boundary assertion.]["b" ]] | |
134 | [[119][The character which when preceded by an escape character represents the non-word boundary assertion.]["B" ]] | |
135 | [[120][The character which when preceded by an escape character represents the word-start boundary assertion.]["<" ]] | |
136 | [[121][The character which when preceded by an escape character represents the word-end boundary assertion.][">" ]] | |
137 | [[122][The character which when preceded by an escape character represents any word character.]["w" ]] | |
138 | [[123][The character which when preceded by an escape character represents a non-word character.]["W" ]] | |
139 | [[124][The character which when preceded by an escape character represents a start of buffer assertion.]["`A" ]] | |
140 | [[125][The character which when preceded by an escape character represents an end of buffer assertion.]["'z" ]] | |
141 | [[126][The newline character. ]["\\n" ]] | |
142 | [[127][The comma separator.]["," ]] | |
143 | [[128][The character which when preceded by an escape character represents the bell character.]["a" ]] | |
144 | [[129][The character which when preceded by an escape character represents the form feed character.]["f" ]] | |
145 | [[130][The character which when preceded by an escape character represents the newline character.]["n" ]] | |
146 | [[131][The character which when preceded by an escape character represents the carriage return character.]["r" ]] | |
147 | [[132][The character which when preceded by an escape character represents the tab character.]["t" ]] | |
148 | [[133][The character which when preceded by an escape character represents the vertical tab character.]["v" ]] | |
149 | [[134][The character which when preceded by an escape character represents the start of a hexadecimal character constant.]["x" ]] | |
150 | [[135][The character which when preceded by an escape character represents the start of an ASCII escape character.]["c" ]] | |
151 | [[136][The colon character.][":" ]] | |
152 | [[137][The equals character.]["=" ]] | |
153 | [[138][The character which when preceded by an escape character represents the ASCII escape character.]["e" ]] | |
154 | [[139][The character which when preceded by an escape character represents any lower case character.]["l" ]] | |
155 | [[140][The character which when preceded by an escape character represents any non-lower case character.]["L" ]] | |
156 | [[141][The character which when preceded by an escape character represents any upper case character.]["u" ]] | |
157 | [[142][The character which when preceded by an escape character represents any non-upper case character.]["U" ]] | |
158 | [[143][The character which when preceded by an escape character represents any space character.]["s" ]] | |
159 | [[144][The character which when preceded by an escape character represents any non-space character.]["S" ]] | |
160 | [[145][The character which when preceded by an escape character represents any digit character.]["d" ]] | |
161 | [[146][The character which when preceded by an escape character represents any non-digit character.]["D" ]] | |
162 | [[147][The character which when preceded by an escape character represents the end quote operator.]["E" ]] | |
163 | [[148][The character which when preceded by an escape character represents the start quote operator.]["Q" ]] | |
164 | [[149][The character which when preceded by an escape character represents a Unicode combining character sequence.]["X" ]] | |
165 | [[150][The character which when preceded by an escape character represents any single character.]["C" ]] | |
166 | [[151][The character which when preceded by an escape character represents end of buffer operator.]["Z" ]] | |
167 | [[152][The character which when preceded by an escape character represents the continuation assertion.]["G" ]] | |
168 | [[153][The character which when preceded by (? indicates a zero width negated forward lookahead assert.][! ]] | |
169 | ] | |
170 | ||
171 | Custom error messages are loaded as follows: | |
172 | ||
173 | [table | |
174 | [[Message ID][Error message ID][Default string ]] | |
175 | [[201][REG_NOMATCH]["No match" ]] | |
176 | [[202][REG_BADPAT]["Invalid regular expression" ]] | |
177 | [[203][REG_ECOLLATE]["Invalid collation character" ]] | |
178 | [[204][REG_ECTYPE]["Invalid character class name" ]] | |
179 | [[205][REG_EESCAPE]["Trailing backslash" ]] | |
180 | [[206][REG_ESUBREG]["Invalid back reference" ]] | |
181 | [[207][REG_EBRACK]["Unmatched \[ or \[^" ]] | |
182 | [[208][REG_EPAREN]["Unmatched ( or \\(" ]] | |
183 | [[209][REG_EBRACE]["Unmatched \\{" ]] | |
184 | [[210][REG_BADBR]["Invalid content of \\{\\}" ]] | |
185 | [[211][REG_ERANGE]["Invalid range end" ]] | |
186 | [[212][REG_ESPACE]["Memory exhausted" ]] | |
187 | [[213][REG_BADRPT]["Invalid preceding regular expression" ]] | |
188 | [[214][REG_EEND]["Premature end of regular expression" ]] | |
189 | [[215][REG_ESIZE]["Regular expression too big" ]] | |
190 | [[216][REG_ERPAREN]["Unmatched ) or \\)" ]] | |
191 | [[217][REG_EMPTY]["Empty expression" ]] | |
192 | [[218][REG_E_UNKNOWN]["Unknown error" ]] | |
193 | ] | |
194 | ||
195 | Custom character class names are loaded as followed: | |
196 | ||
197 | [table | |
198 | [[Message ID][Description][Equivalent default class name ]] | |
199 | [[300][The character class name for alphanumeric characters.]["alnum" ]] | |
200 | [[301][The character class name for alphabetic characters.]["alpha" ]] | |
201 | [[302][The character class name for control characters.]["cntrl" ]] | |
202 | [[303][The character class name for digit characters.]["digit" ]] | |
203 | [[304][The character class name for graphics characters.]["graph" ]] | |
204 | [[305][The character class name for lower case characters.]["lower" ]] | |
205 | [[306][The character class name for printable characters.]["print" ]] | |
206 | [[307][The character class name for punctuation characters.]["punct" ]] | |
207 | [[308][The character class name for space characters.]["space" ]] | |
208 | [[309][The character class name for upper case characters.]["upper" ]] | |
209 | [[310][The character class name for hexadecimal characters.]["xdigit" ]] | |
210 | [[311][The character class name for blank characters.]["blank" ]] | |
211 | [[312][The character class name for word characters.]["word" ]] | |
212 | [[313][The character class name for Unicode characters.]["unicode" ]] | |
213 | ] | |
214 | ||
215 | Finally, custom collating element names are loaded starting from message | |
216 | id 400, and terminating when the first load thereafter fails. Each message | |
217 | looks something like: "tagname string" where tagname is the name used | |
218 | inside [[.tagname.]] and string is the actual text of the collating element. | |
219 | Note that the value of collating element [[.zero.]] is used for the | |
220 | conversion of strings to numbers - if you replace this with another value then | |
221 | that will be used for string parsing - for example use the Unicode | |
222 | character 0x0660 for [[.zero.]] if you want to use Unicode Arabic-Indic | |
223 | digits in your regular expressions in place of Latin digits. | |
224 | ||
225 | Note that the POSIX defined names for character classes and collating elements | |
226 | are always available - even if custom names are defined, in contrast, | |
227 | custom error messages, and custom syntax messages replace the default ones. | |
228 | ||
229 | [endsect] | |
230 | ||
231 |