]> git.proxmox.com Git - rustc.git/blame - vendor/regex-1.4.3/UNICODE.md
New upstream version 1.54.0+dfsg1
[rustc.git] / vendor / regex-1.4.3 / UNICODE.md
CommitLineData
94b46f34
XL
1# Unicode conformance
2
3This document describes the regex crate's conformance to Unicode's
17df50a5 4[UTS#18](http://unicode.org/reports/tr18/)
94b46f34
XL
5report, which lays out 3 levels of support: Basic, Extended and Tailored.
6
7Full support for Level 1 ("Basic Unicode Support") is provided with two
8exceptions:
9
101. Line boundaries are not Unicode aware. Namely, only the `\n`
11 (`END OF LINE`) character is recognized as a line boundary.
122. The compatibility properties specified by
17df50a5 13 [RL1.2a](http://unicode.org/reports/tr18/#RL1.2a)
94b46f34
XL
14 are ASCII-only definitions.
15
16Little to no support is provided for either Level 2 or Level 3. For the most
17part, this is because the features are either complex/hard to implement, or at
18the very least, very difficult to implement without sacrificing performance.
19For example, tackling canonical equivalence such that matching worked as one
20would expect regardless of normalization form would be a significant
21undertaking. This is at least partially a result of the fact that this regex
22engine is based on finite automata, which admits less flexibility normally
23associated with backtracking implementations.
24
25
26## RL1.1 Hex Notation
27
28[UTS#18 RL1.1](https://unicode.org/reports/tr18/#Hex_notation)
29
30Hex Notation refers to the ability to specify a Unicode code point in a regular
31expression via its hexadecimal code point representation. This is useful in
32environments that have poor Unicode font rendering or if you need to express a
33code point that is not normally displayable. All forms of hexadecimal notation
34are supported
35
36 \x7F hex character code (exactly two digits)
37 \x{10FFFF} any hex character code corresponding to a Unicode code point
38 \u007F hex character code (exactly four digits)
39 \u{7F} any hex character code corresponding to a Unicode code point
40 \U0000007F hex character code (exactly eight digits)
41 \U{7F} any hex character code corresponding to a Unicode code point
42
43Briefly, the `\x{...}`, `\u{...}` and `\U{...}` are all exactly equivalent ways
44of expressing hexadecimal code points. Any number of digits can be written
45within the brackets. In contrast, `\xNN`, `\uNNNN`, `\UNNNNNNNN` are all
46fixed-width variants of the same idea.
47
48Note that when Unicode mode is disabled, any non-ASCII Unicode codepoint is
49banned. Additionally, the `\xNN` syntax represents arbitrary bytes when Unicode
50mode is disabled. That is, the regex `\xFF` matches the Unicode codepoint
51U+00FF (encoded as `\xC3\xBF` in UTF-8) while the regex `(?-u)\xFF` matches
52the literal byte `\xFF`.
53
54
55## RL1.2 Properties
56
57[UTS#18 RL1.2](https://unicode.org/reports/tr18/#Categories)
58
59Full support for Unicode property syntax is provided. Unicode properties
60provide a convenient way to construct character classes of groups of code
61points specified by Unicode. The regex crate does not provide exhaustive
62support, but covers a useful subset. In particular:
63
17df50a5
XL
64* [General categories](http://unicode.org/reports/tr18/#General_Category_Property)
65* [Scripts and Script Extensions](http://unicode.org/reports/tr18/#Script_Property)
66* [Age](http://unicode.org/reports/tr18/#Age)
94b46f34 67* A smattering of boolean properties, including all of those specified by
17df50a5 68 [RL1.2](http://unicode.org/reports/tr18/#RL1.2) explicitly.
94b46f34
XL
69
70In all cases, property name and value abbreviations are supported, and all
71names/values are matched loosely without regard for case, whitespace or
72underscores. Property name aliases can be found in Unicode's
17df50a5 73[`PropertyAliases.txt`](http://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt)
94b46f34 74file, while property value aliases can be found in Unicode's
17df50a5 75[`PropertyValueAliases.txt`](http://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
94b46f34
XL
76file.
77
78The syntax supported is also consistent with the UTS#18 recommendation:
79
80* `\p{Greek}` selects the `Greek` script. Equivalent expressions follow:
81 `\p{sc:Greek}`, `\p{Script:Greek}`, `\p{Sc=Greek}`, `\p{script=Greek}`,
82 `\P{sc!=Greek}`. Similarly for `General_Category` (or `gc` for short) and
83 `Script_Extensions` (or `scx` for short).
84* `\p{age:3.2}` selects all code points in Unicode 3.2.
85* `\p{Alphabetic}` selects the "alphabetic" property and can be abbreviated
86 via `\p{alpha}` (for example).
87* Single letter variants for properties with single letter abbreviations.
88 For example, `\p{Letter}` can be equivalently written as `\pL`.
89
90The following is a list of all properties supported by the regex crate (starred
91properties correspond to properties required by RL1.2):
92
93* `General_Category` \* (including `Any`, `ASCII` and `Assigned`)
94* `Script` \*
95* `Script_Extensions` \*
96* `Age`
97* `ASCII_Hex_Digit`
98* `Alphabetic` \*
99* `Bidi_Control`
100* `Case_Ignorable`
101* `Cased`
102* `Changes_When_Casefolded`
103* `Changes_When_Casemapped`
104* `Changes_When_Lowercased`
105* `Changes_When_Titlecased`
106* `Changes_When_Uppercased`
107* `Dash`
108* `Default_Ignorable_Code_Point` \*
109* `Deprecated`
110* `Diacritic`
0731742a
XL
111* `Emoji`
112* `Emoji_Presentation`
113* `Emoji_Modifier`
114* `Emoji_Modifier_Base`
115* `Emoji_Component`
116* `Extended_Pictographic`
94b46f34
XL
117* `Extender`
118* `Grapheme_Base`
0731742a 119* `Grapheme_Cluster_Break`
94b46f34
XL
120* `Grapheme_Extend`
121* `Hex_Digit`
122* `IDS_Binary_Operator`
123* `IDS_Trinary_Operator`
124* `ID_Continue`
125* `ID_Start`
126* `Join_Control`
127* `Logical_Order_Exception`
128* `Lowercase` \*
129* `Math`
130* `Noncharacter_Code_Point` \*
131* `Pattern_Syntax`
132* `Pattern_White_Space`
133* `Prepended_Concatenation_Mark`
134* `Quotation_Mark`
135* `Radical`
136* `Regional_Indicator`
0731742a 137* `Sentence_Break`
94b46f34
XL
138* `Sentence_Terminal`
139* `Soft_Dotted`
140* `Terminal_Punctuation`
141* `Unified_Ideograph`
142* `Uppercase` \*
143* `Variation_Selector`
144* `White_Space` \*
0731742a 145* `Word_Break`
94b46f34
XL
146* `XID_Continue`
147* `XID_Start`
148
149
150## RL1.2a Compatibility Properties
151
17df50a5 152[UTS#18 RL1.2a](http://unicode.org/reports/tr18/#RL1.2a)
94b46f34
XL
153
154The regex crate only provides ASCII definitions of the
17df50a5 155[compatibility properties documented in UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties)
94b46f34
XL
156(sans the `\X` class, for matching grapheme clusters, which isn't provided
157at all). This is because it seems to be consistent with most other regular
158expression engines, and in particular, because these are often referred to as
159"ASCII" or "POSIX" character classes.
160
161Note that the `\w`, `\s` and `\d` character classes **are** Unicode aware.
162Their traditional ASCII definition can be used by disabling Unicode. That is,
163`[[:word:]]` and `(?-u)\w` are equivalent.
164
165
166## RL1.3 Subtraction and Intersection
167
17df50a5 168[UTS#18 RL1.3](http://unicode.org/reports/tr18/#Subtraction_and_Intersection)
94b46f34
XL
169
170The regex crate provides full support for nested character classes, along with
171union, intersection (`&&`), difference (`--`) and symmetric difference (`~~`)
172operations on arbitrary character classes.
173
174For example, to match all non-ASCII letters, you could use either
175`[\p{Letter}--\p{Ascii}]` (difference) or `[\p{Letter}&&[^\p{Ascii}]]`
176(intersecting the negation).
177
178
179## RL1.4 Simple Word Boundaries
180
17df50a5 181[UTS#18 RL1.4](http://unicode.org/reports/tr18/#Simple_Word_Boundaries)
94b46f34
XL
182
183The regex crate provides basic Unicode aware word boundary assertions. A word
184boundary assertion can be written as `\b`, or `\B` as its negation. A word
185boundary negation corresponds to a zero-width match, where its adjacent
186characters correspond to word and non-word, or non-word and word characters.
187
188Conformance in this case chooses to define word character in the same way that
189the `\w` character class is defined: a code point that is a member of one of
190the following classes:
191
192* `\p{Alphabetic}`
193* `\p{Join_Control}`
194* `\p{gc:Mark}`
195* `\p{gc:Decimal_Number}`
196* `\p{gc:Connector_Punctuation}`
197
198In particular, this differs slightly from the
17df50a5 199[prescription given in RL1.4](http://unicode.org/reports/tr18/#Simple_Word_Boundaries)
94b46f34 200but is permissible according to
17df50a5 201[UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties).
94b46f34
XL
202Namely, it is convenient and simpler to have `\w` and `\b` be in sync with
203one another.
204
205Finally, Unicode word boundaries can be disabled, which will cause ASCII word
206boundaries to be used instead. That is, `\b` is a Unicode word boundary while
207`(?-u)\b` is an ASCII-only word boundary. This can occasionally be beneficial
208if performance is important, since the implementation of Unicode word
209boundaries is currently sub-optimal on non-ASCII text.
210
211
212## RL1.5 Simple Loose Matches
213
17df50a5 214[UTS#18 RL1.5](http://unicode.org/reports/tr18/#Simple_Loose_Matches)
94b46f34
XL
215
216The regex crate provides full support for case insensitive matching in
217accordance with RL1.5. That is, it uses the "simple" case folding mapping. The
218"simple" mapping was chosen because of a key convenient property: every
219"simple" mapping is a mapping from exactly one code point to exactly one other
220code point. This makes case insensitive matching of character classes, for
221example, straight-forward to implement.
222
223When case insensitive mode is enabled (e.g., `(?i)[a]` is equivalent to `a|A`),
224then all characters classes are case folded as well.
225
226
227## RL1.6 Line Boundaries
228
17df50a5 229[UTS#18 RL1.6](http://unicode.org/reports/tr18/#Line_Boundaries)
94b46f34
XL
230
231The regex crate only provides support for recognizing the `\n` (`END OF LINE`)
232character as a line boundary. This choice was made mostly for implementation
233convenience, and to avoid performance cliffs that Unicode word boundaries are
234subject to.
235
236Ideally, it would be nice to at least support `\r\n` as a line boundary as
237well, and in theory, this could be done efficiently.
238
239
240## RL1.7 Code Points
241
17df50a5 242[UTS#18 RL1.7](http://unicode.org/reports/tr18/#Supplementary_Characters)
94b46f34
XL
243
244The regex crate provides full support for Unicode code point matching. Namely,
245the fundamental atom of any match is always a single code point.
246
247Given Rust's strong ties to UTF-8, the following guarantees are also provided:
248
249* All matches are reported on valid UTF-8 code unit boundaries. That is, any
250 match range returned by the public regex API is guaranteed to successfully
251 slice the string that was searched.
252* By consequence of the above, it is impossible to match surrogode code points.
253 No support for UTF-16 is provided, so this is never necessary.
254
255Note that when Unicode mode is disabled, the fundamental atom of matching is
256no longer a code point but a single byte. When Unicode mode is disabled, many
257Unicode features are disabled as well. For example, `(?-u)\pL` is not a valid
258regex but `\pL(?-u)\xFF` (matches any Unicode `Letter` followed by the literal
259byte `\xFF`) is, for example.