]>
Commit | Line | Data |
---|---|---|
94b46f34 XL |
1 | # Unicode conformance |
2 | ||
3 | This document describes the regex crate's conformance to Unicode's | |
17df50a5 | 4 | [UTS#18](http://unicode.org/reports/tr18/) |
94b46f34 XL |
5 | report, which lays out 3 levels of support: Basic, Extended and Tailored. |
6 | ||
7 | Full support for Level 1 ("Basic Unicode Support") is provided with two | |
8 | exceptions: | |
9 | ||
10 | 1. Line boundaries are not Unicode aware. Namely, only the `\n` | |
11 | (`END OF LINE`) character is recognized as a line boundary. | |
12 | 2. The compatibility properties specified by | |
17df50a5 | 13 | [RL1.2a](http://unicode.org/reports/tr18/#RL1.2a) |
94b46f34 XL |
14 | are ASCII-only definitions. |
15 | ||
16 | Little to no support is provided for either Level 2 or Level 3. For the most | |
17 | part, this is because the features are either complex/hard to implement, or at | |
18 | the very least, very difficult to implement without sacrificing performance. | |
19 | For example, tackling canonical equivalence such that matching worked as one | |
20 | would expect regardless of normalization form would be a significant | |
21 | undertaking. This is at least partially a result of the fact that this regex | |
22 | engine is based on finite automata, which admits less flexibility normally | |
23 | associated with backtracking implementations. | |
24 | ||
25 | ||
26 | ## RL1.1 Hex Notation | |
27 | ||
28 | [UTS#18 RL1.1](https://unicode.org/reports/tr18/#Hex_notation) | |
29 | ||
30 | Hex Notation refers to the ability to specify a Unicode code point in a regular | |
31 | expression via its hexadecimal code point representation. This is useful in | |
32 | environments that have poor Unicode font rendering or if you need to express a | |
33 | code point that is not normally displayable. All forms of hexadecimal notation | |
34 | are supported | |
35 | ||
36 | \x7F hex character code (exactly two digits) | |
37 | \x{10FFFF} any hex character code corresponding to a Unicode code point | |
38 | \u007F hex character code (exactly four digits) | |
39 | \u{7F} any hex character code corresponding to a Unicode code point | |
40 | \U0000007F hex character code (exactly eight digits) | |
41 | \U{7F} any hex character code corresponding to a Unicode code point | |
42 | ||
43 | Briefly, the `\x{...}`, `\u{...}` and `\U{...}` are all exactly equivalent ways | |
44 | of expressing hexadecimal code points. Any number of digits can be written | |
45 | within the brackets. In contrast, `\xNN`, `\uNNNN`, `\UNNNNNNNN` are all | |
46 | fixed-width variants of the same idea. | |
47 | ||
48 | Note that when Unicode mode is disabled, any non-ASCII Unicode codepoint is | |
49 | banned. Additionally, the `\xNN` syntax represents arbitrary bytes when Unicode | |
50 | mode is disabled. That is, the regex `\xFF` matches the Unicode codepoint | |
51 | U+00FF (encoded as `\xC3\xBF` in UTF-8) while the regex `(?-u)\xFF` matches | |
52 | the literal byte `\xFF`. | |
53 | ||
54 | ||
55 | ## RL1.2 Properties | |
56 | ||
57 | [UTS#18 RL1.2](https://unicode.org/reports/tr18/#Categories) | |
58 | ||
59 | Full support for Unicode property syntax is provided. Unicode properties | |
60 | provide a convenient way to construct character classes of groups of code | |
61 | points specified by Unicode. The regex crate does not provide exhaustive | |
62 | support, but covers a useful subset. In particular: | |
63 | ||
17df50a5 XL |
64 | * [General categories](http://unicode.org/reports/tr18/#General_Category_Property) |
65 | * [Scripts and Script Extensions](http://unicode.org/reports/tr18/#Script_Property) | |
66 | * [Age](http://unicode.org/reports/tr18/#Age) | |
94b46f34 | 67 | * A smattering of boolean properties, including all of those specified by |
17df50a5 | 68 | [RL1.2](http://unicode.org/reports/tr18/#RL1.2) explicitly. |
94b46f34 XL |
69 | |
70 | In all cases, property name and value abbreviations are supported, and all | |
71 | names/values are matched loosely without regard for case, whitespace or | |
72 | underscores. Property name aliases can be found in Unicode's | |
17df50a5 | 73 | [`PropertyAliases.txt`](http://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt) |
94b46f34 | 74 | file, while property value aliases can be found in Unicode's |
17df50a5 | 75 | [`PropertyValueAliases.txt`](http://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt) |
94b46f34 XL |
76 | file. |
77 | ||
78 | The syntax supported is also consistent with the UTS#18 recommendation: | |
79 | ||
80 | * `\p{Greek}` selects the `Greek` script. Equivalent expressions follow: | |
81 | `\p{sc:Greek}`, `\p{Script:Greek}`, `\p{Sc=Greek}`, `\p{script=Greek}`, | |
82 | `\P{sc!=Greek}`. Similarly for `General_Category` (or `gc` for short) and | |
83 | `Script_Extensions` (or `scx` for short). | |
84 | * `\p{age:3.2}` selects all code points in Unicode 3.2. | |
85 | * `\p{Alphabetic}` selects the "alphabetic" property and can be abbreviated | |
86 | via `\p{alpha}` (for example). | |
87 | * Single letter variants for properties with single letter abbreviations. | |
88 | For example, `\p{Letter}` can be equivalently written as `\pL`. | |
89 | ||
90 | The following is a list of all properties supported by the regex crate (starred | |
91 | properties correspond to properties required by RL1.2): | |
92 | ||
93 | * `General_Category` \* (including `Any`, `ASCII` and `Assigned`) | |
94 | * `Script` \* | |
95 | * `Script_Extensions` \* | |
96 | * `Age` | |
97 | * `ASCII_Hex_Digit` | |
98 | * `Alphabetic` \* | |
99 | * `Bidi_Control` | |
100 | * `Case_Ignorable` | |
101 | * `Cased` | |
102 | * `Changes_When_Casefolded` | |
103 | * `Changes_When_Casemapped` | |
104 | * `Changes_When_Lowercased` | |
105 | * `Changes_When_Titlecased` | |
106 | * `Changes_When_Uppercased` | |
107 | * `Dash` | |
108 | * `Default_Ignorable_Code_Point` \* | |
109 | * `Deprecated` | |
110 | * `Diacritic` | |
0731742a XL |
111 | * `Emoji` |
112 | * `Emoji_Presentation` | |
113 | * `Emoji_Modifier` | |
114 | * `Emoji_Modifier_Base` | |
115 | * `Emoji_Component` | |
116 | * `Extended_Pictographic` | |
94b46f34 XL |
117 | * `Extender` |
118 | * `Grapheme_Base` | |
0731742a | 119 | * `Grapheme_Cluster_Break` |
94b46f34 XL |
120 | * `Grapheme_Extend` |
121 | * `Hex_Digit` | |
122 | * `IDS_Binary_Operator` | |
123 | * `IDS_Trinary_Operator` | |
124 | * `ID_Continue` | |
125 | * `ID_Start` | |
126 | * `Join_Control` | |
127 | * `Logical_Order_Exception` | |
128 | * `Lowercase` \* | |
129 | * `Math` | |
130 | * `Noncharacter_Code_Point` \* | |
131 | * `Pattern_Syntax` | |
132 | * `Pattern_White_Space` | |
133 | * `Prepended_Concatenation_Mark` | |
134 | * `Quotation_Mark` | |
135 | * `Radical` | |
136 | * `Regional_Indicator` | |
0731742a | 137 | * `Sentence_Break` |
94b46f34 XL |
138 | * `Sentence_Terminal` |
139 | * `Soft_Dotted` | |
140 | * `Terminal_Punctuation` | |
141 | * `Unified_Ideograph` | |
142 | * `Uppercase` \* | |
143 | * `Variation_Selector` | |
144 | * `White_Space` \* | |
0731742a | 145 | * `Word_Break` |
94b46f34 XL |
146 | * `XID_Continue` |
147 | * `XID_Start` | |
148 | ||
149 | ||
150 | ## RL1.2a Compatibility Properties | |
151 | ||
17df50a5 | 152 | [UTS#18 RL1.2a](http://unicode.org/reports/tr18/#RL1.2a) |
94b46f34 XL |
153 | |
154 | The regex crate only provides ASCII definitions of the | |
17df50a5 | 155 | [compatibility properties documented in UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties) |
94b46f34 XL |
156 | (sans the `\X` class, for matching grapheme clusters, which isn't provided |
157 | at all). This is because it seems to be consistent with most other regular | |
158 | expression engines, and in particular, because these are often referred to as | |
159 | "ASCII" or "POSIX" character classes. | |
160 | ||
161 | Note that the `\w`, `\s` and `\d` character classes **are** Unicode aware. | |
162 | Their traditional ASCII definition can be used by disabling Unicode. That is, | |
163 | `[[:word:]]` and `(?-u)\w` are equivalent. | |
164 | ||
165 | ||
166 | ## RL1.3 Subtraction and Intersection | |
167 | ||
17df50a5 | 168 | [UTS#18 RL1.3](http://unicode.org/reports/tr18/#Subtraction_and_Intersection) |
94b46f34 XL |
169 | |
170 | The regex crate provides full support for nested character classes, along with | |
171 | union, intersection (`&&`), difference (`--`) and symmetric difference (`~~`) | |
172 | operations on arbitrary character classes. | |
173 | ||
174 | For example, to match all non-ASCII letters, you could use either | |
175 | `[\p{Letter}--\p{Ascii}]` (difference) or `[\p{Letter}&&[^\p{Ascii}]]` | |
176 | (intersecting the negation). | |
177 | ||
178 | ||
179 | ## RL1.4 Simple Word Boundaries | |
180 | ||
17df50a5 | 181 | [UTS#18 RL1.4](http://unicode.org/reports/tr18/#Simple_Word_Boundaries) |
94b46f34 XL |
182 | |
183 | The regex crate provides basic Unicode aware word boundary assertions. A word | |
184 | boundary assertion can be written as `\b`, or `\B` as its negation. A word | |
185 | boundary negation corresponds to a zero-width match, where its adjacent | |
186 | characters correspond to word and non-word, or non-word and word characters. | |
187 | ||
188 | Conformance in this case chooses to define word character in the same way that | |
189 | the `\w` character class is defined: a code point that is a member of one of | |
190 | the following classes: | |
191 | ||
192 | * `\p{Alphabetic}` | |
193 | * `\p{Join_Control}` | |
194 | * `\p{gc:Mark}` | |
195 | * `\p{gc:Decimal_Number}` | |
196 | * `\p{gc:Connector_Punctuation}` | |
197 | ||
198 | In particular, this differs slightly from the | |
17df50a5 | 199 | [prescription given in RL1.4](http://unicode.org/reports/tr18/#Simple_Word_Boundaries) |
94b46f34 | 200 | but is permissible according to |
17df50a5 | 201 | [UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties). |
94b46f34 XL |
202 | Namely, it is convenient and simpler to have `\w` and `\b` be in sync with |
203 | one another. | |
204 | ||
205 | Finally, Unicode word boundaries can be disabled, which will cause ASCII word | |
206 | boundaries to be used instead. That is, `\b` is a Unicode word boundary while | |
207 | `(?-u)\b` is an ASCII-only word boundary. This can occasionally be beneficial | |
208 | if performance is important, since the implementation of Unicode word | |
209 | boundaries is currently sub-optimal on non-ASCII text. | |
210 | ||
211 | ||
212 | ## RL1.5 Simple Loose Matches | |
213 | ||
17df50a5 | 214 | [UTS#18 RL1.5](http://unicode.org/reports/tr18/#Simple_Loose_Matches) |
94b46f34 XL |
215 | |
216 | The regex crate provides full support for case insensitive matching in | |
217 | accordance with RL1.5. That is, it uses the "simple" case folding mapping. The | |
218 | "simple" mapping was chosen because of a key convenient property: every | |
219 | "simple" mapping is a mapping from exactly one code point to exactly one other | |
220 | code point. This makes case insensitive matching of character classes, for | |
221 | example, straight-forward to implement. | |
222 | ||
223 | When case insensitive mode is enabled (e.g., `(?i)[a]` is equivalent to `a|A`), | |
224 | then all characters classes are case folded as well. | |
225 | ||
226 | ||
227 | ## RL1.6 Line Boundaries | |
228 | ||
17df50a5 | 229 | [UTS#18 RL1.6](http://unicode.org/reports/tr18/#Line_Boundaries) |
94b46f34 XL |
230 | |
231 | The regex crate only provides support for recognizing the `\n` (`END OF LINE`) | |
232 | character as a line boundary. This choice was made mostly for implementation | |
233 | convenience, and to avoid performance cliffs that Unicode word boundaries are | |
234 | subject to. | |
235 | ||
236 | Ideally, it would be nice to at least support `\r\n` as a line boundary as | |
237 | well, and in theory, this could be done efficiently. | |
238 | ||
239 | ||
240 | ## RL1.7 Code Points | |
241 | ||
17df50a5 | 242 | [UTS#18 RL1.7](http://unicode.org/reports/tr18/#Supplementary_Characters) |
94b46f34 XL |
243 | |
244 | The regex crate provides full support for Unicode code point matching. Namely, | |
245 | the fundamental atom of any match is always a single code point. | |
246 | ||
247 | Given Rust's strong ties to UTF-8, the following guarantees are also provided: | |
248 | ||
249 | * All matches are reported on valid UTF-8 code unit boundaries. That is, any | |
250 | match range returned by the public regex API is guaranteed to successfully | |
251 | slice the string that was searched. | |
252 | * By consequence of the above, it is impossible to match surrogode code points. | |
253 | No support for UTF-16 is provided, so this is never necessary. | |
254 | ||
255 | Note that when Unicode mode is disabled, the fundamental atom of matching is | |
256 | no longer a code point but a single byte. When Unicode mode is disabled, many | |
257 | Unicode features are disabled as well. For example, `(?-u)\pL` is not a valid | |
258 | regex but `\pL(?-u)\xFF` (matches any Unicode `Letter` followed by the literal | |
259 | byte `\xFF`) is, for example. |