vendor/regex/UNICODE.md

   1 # Unicode conformance
   2
   3 This document describes the regex crate's conformance to Unicode's
   4 [UTS#18](https://unicode.org/reports/tr18/)
   5 report, which lays out 3 levels of support: Basic, Extended and Tailored.
   6
   7 Full support for Level 1 ("Basic Unicode Support") is provided with two
   8 exceptions:
   9
  10 1. Line boundaries are not Unicode aware. Namely, only the `\n`
  11    (`END OF LINE`) character is recognized as a line boundary.
  12 2. The compatibility properties specified by
  13    [RL1.2a](https://unicode.org/reports/tr18/#RL1.2a)
  14    are ASCII-only definitions.
  15
  16 Little to no support is provided for either Level 2 or Level 3. For the most
  17 part, this is because the features are either complex/hard to implement, or at
  18 the very least, very difficult to implement without sacrificing performance.
  19 For example, tackling canonical equivalence such that matching worked as one
  20 would expect regardless of normalization form would be a significant
  21 undertaking. This is at least partially a result of the fact that this regex
  22 engine is based on finite automata, which admits less flexibility normally
  23 associated with backtracking implementations.
  24
  25
  26 ## RL1.1 Hex Notation
  27
  28 [UTS#18 RL1.1](https://unicode.org/reports/tr18/#Hex_notation)
  29
  30 Hex Notation refers to the ability to specify a Unicode code point in a regular
  31 expression via its hexadecimal code point representation. This is useful in
  32 environments that have poor Unicode font rendering or if you need to express a
  33 code point that is not normally displayable. All forms of hexadecimal notation
  34 are supported
  35
  36     \x7F        hex character code (exactly two digits)
  37     \x{10FFFF}  any hex character code corresponding to a Unicode code point
  38     \u007F      hex character code (exactly four digits)
  39     \u{7F}      any hex character code corresponding to a Unicode code point
  40     \U0000007F  hex character code (exactly eight digits)
  41     \U{7F}      any hex character code corresponding to a Unicode code point
  42
  43 Briefly, the `\x{...}`, `\u{...}` and `\U{...}` are all exactly equivalent ways
  44 of expressing hexadecimal code points. Any number of digits can be written
  45 within the brackets. In contrast, `\xNN`, `\uNNNN`, `\UNNNNNNNN` are all
  46 fixed-width variants of the same idea.
  47
  48 Note that when Unicode mode is disabled, any non-ASCII Unicode codepoint is
  49 banned. Additionally, the `\xNN` syntax represents arbitrary bytes when Unicode
  50 mode is disabled. That is, the regex `\xFF` matches the Unicode codepoint
  51 U+00FF (encoded as `\xC3\xBF` in UTF-8) while the regex `(?-u)\xFF` matches
  52 the literal byte `\xFF`.
  53
  54
  55 ## RL1.2 Properties
  56
  57 [UTS#18 RL1.2](https://unicode.org/reports/tr18/#Categories)
  58
  59 Full support for Unicode property syntax is provided. Unicode properties
  60 provide a convenient way to construct character classes of groups of code
  61 points specified by Unicode. The regex crate does not provide exhaustive
  62 support, but covers a useful subset. In particular:
  63
  64 * [General categories](https://unicode.org/reports/tr18/#General_Category_Property)
  65 * [Scripts and Script Extensions](https://unicode.org/reports/tr18/#Script_Property)
  66 * [Age](https://unicode.org/reports/tr18/#Age)
  67 * A smattering of boolean properties, including all of those specified by
  68   [RL1.2](https://unicode.org/reports/tr18/#RL1.2) explicitly.
  69
  70 In all cases, property name and value abbreviations are supported, and all
  71 names/values are matched loosely without regard for case, whitespace or
  72 underscores. Property name aliases can be found in Unicode's
  73 [`PropertyAliases.txt`](https://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt)
  74 file, while property value aliases can be found in Unicode's
  75 [`PropertyValueAliases.txt`](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
  76 file.
  77
  78 The syntax supported is also consistent with the UTS#18 recommendation:
  79
  80 * `\p{Greek}` selects the `Greek` script. Equivalent expressions follow:
  81   `\p{sc:Greek}`, `\p{Script:Greek}`, `\p{Sc=Greek}`, `\p{script=Greek}`,
  82   `\P{sc!=Greek}`. Similarly for `General_Category` (or `gc` for short) and
  83   `Script_Extensions` (or `scx` for short).
  84 * `\p{age:3.2}` selects all code points in Unicode 3.2.
  85 * `\p{Alphabetic}` selects the "alphabetic" property and can be abbreviated
  86   via `\p{alpha}` (for example).
  87 * Single letter variants for properties with single letter abbreviations.
  88   For example, `\p{Letter}` can be equivalently written as `\pL`.
  89
  90 The following is a list of all properties supported by the regex crate (starred
  91 properties correspond to properties required by RL1.2):
  92
  93 * `General_Category` \* (including `Any`, `ASCII` and `Assigned`)
  94 * `Script` \*
  95 * `Script_Extensions` \*
  96 * `Age`
  97 * `ASCII_Hex_Digit`
  98 * `Alphabetic` \*
  99 * `Bidi_Control`
 100 * `Case_Ignorable`
 101 * `Cased`
 102 * `Changes_When_Casefolded`
 103 * `Changes_When_Casemapped`
 104 * `Changes_When_Lowercased`
 105 * `Changes_When_Titlecased`
 106 * `Changes_When_Uppercased`
 107 * `Dash`
 108 * `Default_Ignorable_Code_Point` \*
 109 * `Deprecated`
 110 * `Diacritic`
 111 * `Emoji`
 112 * `Emoji_Presentation`
 113 * `Emoji_Modifier`
 114 * `Emoji_Modifier_Base`
 115 * `Emoji_Component`
 116 * `Extended_Pictographic`
 117 * `Extender`
 118 * `Grapheme_Base`
 119 * `Grapheme_Cluster_Break`
 120 * `Grapheme_Extend`
 121 * `Hex_Digit`
 122 * `IDS_Binary_Operator`
 123 * `IDS_Trinary_Operator`
 124 * `ID_Continue`
 125 * `ID_Start`
 126 * `Join_Control`
 127 * `Logical_Order_Exception`
 128 * `Lowercase` \*
 129 * `Math`
 130 * `Noncharacter_Code_Point` \*
 131 * `Pattern_Syntax`
 132 * `Pattern_White_Space`
 133 * `Prepended_Concatenation_Mark`
 134 * `Quotation_Mark`
 135 * `Radical`
 136 * `Regional_Indicator`
 137 * `Sentence_Break`
 138 * `Sentence_Terminal`
 139 * `Soft_Dotted`
 140 * `Terminal_Punctuation`
 141 * `Unified_Ideograph`
 142 * `Uppercase` \*
 143 * `Variation_Selector`
 144 * `White_Space` \*
 145 * `Word_Break`
 146 * `XID_Continue`
 147 * `XID_Start`
 148
 149
 150 ## RL1.2a Compatibility Properties
 151
 152 [UTS#18 RL1.2a](https://unicode.org/reports/tr18/#RL1.2a)
 153
 154 The regex crate only provides ASCII definitions of the
 155 [compatibility properties documented in UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties)
 156 (sans the `\X` class, for matching grapheme clusters, which isn't provided
 157 at all). This is because it seems to be consistent with most other regular
 158 expression engines, and in particular, because these are often referred to as
 159 "ASCII" or "POSIX" character classes.
 160
 161 Note that the `\w`, `\s` and `\d` character classes **are** Unicode aware.
 162 Their traditional ASCII definition can be used by disabling Unicode. That is,
 163 `[[:word:]]` and `(?-u)\w` are equivalent.
 164
 165
 166 ## RL1.3 Subtraction and Intersection
 167
 168 [UTS#18 RL1.3](https://unicode.org/reports/tr18/#Subtraction_and_Intersection)
 169
 170 The regex crate provides full support for nested character classes, along with
 171 union, intersection (`&&`), difference (`--`) and symmetric difference (`~~`)
 172 operations on arbitrary character classes.
 173
 174 For example, to match all non-ASCII letters, you could use either
 175 `[\p{Letter}--\p{Ascii}]` (difference) or `[\p{Letter}&&[^\p{Ascii}]]`
 176 (intersecting the negation).
 177
 178
 179 ## RL1.4 Simple Word Boundaries
 180
 181 [UTS#18 RL1.4](https://unicode.org/reports/tr18/#Simple_Word_Boundaries)
 182
 183 The regex crate provides basic Unicode aware word boundary assertions. A word
 184 boundary assertion can be written as `\b`, or `\B` as its negation. A word
 185 boundary negation corresponds to a zero-width match, where its adjacent
 186 characters correspond to word and non-word, or non-word and word characters.
 187
 188 Conformance in this case chooses to define word character in the same way that
 189 the `\w` character class is defined: a code point that is a member of one of
 190 the following classes:
 191
 192 * `\p{Alphabetic}`
 193 * `\p{Join_Control}`
 194 * `\p{gc:Mark}`
 195 * `\p{gc:Decimal_Number}`
 196 * `\p{gc:Connector_Punctuation}`
 197
 198 In particular, this differs slightly from the
 199 [prescription given in RL1.4](https://unicode.org/reports/tr18/#Simple_Word_Boundaries)
 200 but is permissible according to
 201 [UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties).
 202 Namely, it is convenient and simpler to have `\w` and `\b` be in sync with
 203 one another.
 204
 205 Finally, Unicode word boundaries can be disabled, which will cause ASCII word
 206 boundaries to be used instead. That is, `\b` is a Unicode word boundary while
 207 `(?-u)\b` is an ASCII-only word boundary. This can occasionally be beneficial
 208 if performance is important, since the implementation of Unicode word
 209 boundaries is currently sub-optimal on non-ASCII text.
 210
 211
 212 ## RL1.5 Simple Loose Matches
 213
 214 [UTS#18 RL1.5](https://unicode.org/reports/tr18/#Simple_Loose_Matches)
 215
 216 The regex crate provides full support for case insensitive matching in
 217 accordance with RL1.5. That is, it uses the "simple" case folding mapping. The
 218 "simple" mapping was chosen because of a key convenient property: every
 219 "simple" mapping is a mapping from exactly one code point to exactly one other
 220 code point. This makes case insensitive matching of character classes, for
 221 example, straight-forward to implement.
 222
 223 When case insensitive mode is enabled (e.g., `(?i)[a]` is equivalent to `a|A`),
 224 then all characters classes are case folded as well.
 225
 226
 227 ## RL1.6 Line Boundaries
 228
 229 [UTS#18 RL1.6](https://unicode.org/reports/tr18/#Line_Boundaries)
 230
 231 The regex crate only provides support for recognizing the `\n` (`END OF LINE`)
 232 character as a line boundary. This choice was made mostly for implementation
 233 convenience, and to avoid performance cliffs that Unicode word boundaries are
 234 subject to.
 235
 236 Ideally, it would be nice to at least support `\r\n` as a line boundary as
 237 well, and in theory, this could be done efficiently.
 238
 239
 240 ## RL1.7 Code Points
 241
 242 [UTS#18 RL1.7](https://unicode.org/reports/tr18/#Supplementary_Characters)
 243
 244 The regex crate provides full support for Unicode code point matching. Namely,
 245 the fundamental atom of any match is always a single code point.
 246
 247 Given Rust's strong ties to UTF-8, the following guarantees are also provided:
 248
 249 * All matches are reported on valid UTF-8 code unit boundaries. That is, any
 250   match range returned by the public regex API is guaranteed to successfully
 251   slice the string that was searched.
 252 * By consequence of the above, it is impossible to match surrogode code points.
 253   No support for UTF-16 is provided, so this is never necessary.
 254
 255 Note that when Unicode mode is disabled, the fundamental atom of matching is
 256 no longer a code point but a single byte. When Unicode mode is disabled, many
 257 Unicode features are disabled as well. For example, `(?-u)\pL` is not a valid
 258 regex but `\pL(?-u)\xFF` (matches any Unicode `Letter` followed by the literal
 259 byte `\xFF`) is, for example.