[rustc.git] / vendor / regex-1.4.3 / UNICODE.md

# Unicode conformance

This document describes the regex crate's conformance to Unicode's
[UTS#18](http://unicode.org/reports/tr18/)
report, which lays out 3 levels of support: Basic, Extended and Tailored.

Full support for Level 1 ("Basic Unicode Support") is provided with two
exceptions:

1. Line boundaries are not Unicode aware. Namely, only the `\n`
   (`END OF LINE`) character is recognized as a line boundary.
2. The compatibility properties specified by
   [RL1.2a](http://unicode.org/reports/tr18/#RL1.2a)
   are ASCII-only definitions.

Little to no support is provided for either Level 2 or Level 3. For the most
part, this is because the features are either complex/hard to implement, or at
the very least, very difficult to implement without sacrificing performance.
For example, tackling canonical equivalence such that matching worked as one
would expect regardless of normalization form would be a significant
undertaking. This is at least partially a result of the fact that this regex
engine is based on finite automata, which admits less flexibility normally
associated with backtracking implementations.


## RL1.1 Hex Notation

[UTS#18 RL1.1](https://unicode.org/reports/tr18/#Hex_notation)

Hex Notation refers to the ability to specify a Unicode code point in a regular
expression via its hexadecimal code point representation. This is useful in
environments that have poor Unicode font rendering or if you need to express a
code point that is not normally displayable. All forms of hexadecimal notation
are supported

    \x7F        hex character code (exactly two digits)
    \x{10FFFF}  any hex character code corresponding to a Unicode code point
    \u007F      hex character code (exactly four digits)
    \u{7F}      any hex character code corresponding to a Unicode code point
    \U0000007F  hex character code (exactly eight digits)
    \U{7F}      any hex character code corresponding to a Unicode code point

Briefly, the `\x{...}`, `\u{...}` and `\U{...}` are all exactly equivalent ways
of expressing hexadecimal code points. Any number of digits can be written
within the brackets. In contrast, `\xNN`, `\uNNNN`, `\UNNNNNNNN` are all
fixed-width variants of the same idea.

Note that when Unicode mode is disabled, any non-ASCII Unicode codepoint is
banned. Additionally, the `\xNN` syntax represents arbitrary bytes when Unicode
mode is disabled. That is, the regex `\xFF` matches the Unicode codepoint
U+00FF (encoded as `\xC3\xBF` in UTF-8) while the regex `(?-u)\xFF` matches
the literal byte `\xFF`.


## RL1.2 Properties

[UTS#18 RL1.2](https://unicode.org/reports/tr18/#Categories)

Full support for Unicode property syntax is provided. Unicode properties
provide a convenient way to construct character classes of groups of code
points specified by Unicode. The regex crate does not provide exhaustive
support, but covers a useful subset. In particular:

* [General categories](http://unicode.org/reports/tr18/#General_Category_Property)
* [Scripts and Script Extensions](http://unicode.org/reports/tr18/#Script_Property)
* [Age](http://unicode.org/reports/tr18/#Age)
* A smattering of boolean properties, including all of those specified by
  [RL1.2](http://unicode.org/reports/tr18/#RL1.2) explicitly.

In all cases, property name and value abbreviations are supported, and all
names/values are matched loosely without regard for case, whitespace or
underscores. Property name aliases can be found in Unicode's
[`PropertyAliases.txt`](http://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt)
file, while property value aliases can be found in Unicode's
[`PropertyValueAliases.txt`](http://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
file.

The syntax supported is also consistent with the UTS#18 recommendation:

* `\p{Greek}` selects the `Greek` script. Equivalent expressions follow:
  `\p{sc:Greek}`, `\p{Script:Greek}`, `\p{Sc=Greek}`, `\p{script=Greek}`,
  `\P{sc!=Greek}`. Similarly for `General_Category` (or `gc` for short) and
  `Script_Extensions` (or `scx` for short).
* `\p{age:3.2}` selects all code points in Unicode 3.2.
* `\p{Alphabetic}` selects the "alphabetic" property and can be abbreviated
  via `\p{alpha}` (for example).
* Single letter variants for properties with single letter abbreviations.
  For example, `\p{Letter}` can be equivalently written as `\pL`.

The following is a list of all properties supported by the regex crate (starred
properties correspond to properties required by RL1.2):

* `General_Category` \* (including `Any`, `ASCII` and `Assigned`)
* `Script` \*
* `Script_Extensions` \*
* `Age`
* `ASCII_Hex_Digit`
* `Alphabetic` \*
* `Bidi_Control`
* `Case_Ignorable`
* `Cased`
* `Changes_When_Casefolded`
* `Changes_When_Casemapped`
* `Changes_When_Lowercased`
* `Changes_When_Titlecased`
* `Changes_When_Uppercased`
* `Dash`
* `Default_Ignorable_Code_Point` \*
* `Deprecated`
* `Diacritic`
* `Emoji`
* `Emoji_Presentation`
* `Emoji_Modifier`
* `Emoji_Modifier_Base`
* `Emoji_Component`
* `Extended_Pictographic`
* `Extender`
* `Grapheme_Base`
* `Grapheme_Cluster_Break`
* `Grapheme_Extend`
* `Hex_Digit`
* `IDS_Binary_Operator`
* `IDS_Trinary_Operator`
* `ID_Continue`
* `ID_Start`
* `Join_Control`
* `Logical_Order_Exception`
* `Lowercase` \*
* `Math`
* `Noncharacter_Code_Point` \*
* `Pattern_Syntax`
* `Pattern_White_Space`
* `Prepended_Concatenation_Mark`
* `Quotation_Mark`
* `Radical`
* `Regional_Indicator`
* `Sentence_Break`
* `Sentence_Terminal`
* `Soft_Dotted`
* `Terminal_Punctuation`
* `Unified_Ideograph`
* `Uppercase` \*
* `Variation_Selector`
* `White_Space` \*
* `Word_Break`
* `XID_Continue`
* `XID_Start`


## RL1.2a Compatibility Properties

[UTS#18 RL1.2a](http://unicode.org/reports/tr18/#RL1.2a)

The regex crate only provides ASCII definitions of the
[compatibility properties documented in UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties)
(sans the `\X` class, for matching grapheme clusters, which isn't provided
at all). This is because it seems to be consistent with most other regular
expression engines, and in particular, because these are often referred to as
"ASCII" or "POSIX" character classes.

Note that the `\w`, `\s` and `\d` character classes **are** Unicode aware.
Their traditional ASCII definition can be used by disabling Unicode. That is,
`[[:word:]]` and `(?-u)\w` are equivalent.


## RL1.3 Subtraction and Intersection

[UTS#18 RL1.3](http://unicode.org/reports/tr18/#Subtraction_and_Intersection)

The regex crate provides full support for nested character classes, along with
union, intersection (`&&`), difference (`--`) and symmetric difference (`~~`)
operations on arbitrary character classes.

For example, to match all non-ASCII letters, you could use either
`[\p{Letter}--\p{Ascii}]` (difference) or `[\p{Letter}&&[^\p{Ascii}]]`
(intersecting the negation).


## RL1.4 Simple Word Boundaries

[UTS#18 RL1.4](http://unicode.org/reports/tr18/#Simple_Word_Boundaries)

The regex crate provides basic Unicode aware word boundary assertions. A word
boundary assertion can be written as `\b`, or `\B` as its negation. A word
boundary negation corresponds to a zero-width match, where its adjacent
characters correspond to word and non-word, or non-word and word characters.

Conformance in this case chooses to define word character in the same way that
the `\w` character class is defined: a code point that is a member of one of
the following classes:

* `\p{Alphabetic}`
* `\p{Join_Control}`
* `\p{gc:Mark}`
* `\p{gc:Decimal_Number}`
* `\p{gc:Connector_Punctuation}`

In particular, this differs slightly from the
[prescription given in RL1.4](http://unicode.org/reports/tr18/#Simple_Word_Boundaries)
but is permissible according to
[UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties).
Namely, it is convenient and simpler to have `\w` and `\b` be in sync with
one another.

Finally, Unicode word boundaries can be disabled, which will cause ASCII word
boundaries to be used instead. That is, `\b` is a Unicode word boundary while
`(?-u)\b` is an ASCII-only word boundary. This can occasionally be beneficial
if performance is important, since the implementation of Unicode word
boundaries is currently sub-optimal on non-ASCII text.


## RL1.5 Simple Loose Matches

[UTS#18 RL1.5](http://unicode.org/reports/tr18/#Simple_Loose_Matches)

The regex crate provides full support for case insensitive matching in
accordance with RL1.5. That is, it uses the "simple" case folding mapping. The
"simple" mapping was chosen because of a key convenient property: every
"simple" mapping is a mapping from exactly one code point to exactly one other
code point. This makes case insensitive matching of character classes, for
example, straight-forward to implement.

When case insensitive mode is enabled (e.g., `(?i)[a]` is equivalent to `a|A`),
then all characters classes are case folded as well.


## RL1.6 Line Boundaries

[UTS#18 RL1.6](http://unicode.org/reports/tr18/#Line_Boundaries)

The regex crate only provides support for recognizing the `\n` (`END OF LINE`)
character as a line boundary. This choice was made mostly for implementation
convenience, and to avoid performance cliffs that Unicode word boundaries are
subject to.

Ideally, it would be nice to at least support `\r\n` as a line boundary as
well, and in theory, this could be done efficiently.


## RL1.7 Code Points

[UTS#18 RL1.7](http://unicode.org/reports/tr18/#Supplementary_Characters)

The regex crate provides full support for Unicode code point matching. Namely,
the fundamental atom of any match is always a single code point.

Given Rust's strong ties to UTF-8, the following guarantees are also provided:

* All matches are reported on valid UTF-8 code unit boundaries. That is, any
  match range returned by the public regex API is guaranteed to successfully
  slice the string that was searched.
* By consequence of the above, it is impossible to match surrogode code points.
  No support for UTF-16 is provided, so this is never necessary.

Note that when Unicode mode is disabled, the fundamental atom of matching is
no longer a code point but a single byte. When Unicode mode is disabled, many
Unicode features are disabled as well. For example, `(?-u)\pL` is not a valid
regex but `\pL(?-u)\xFF` (matches any Unicode `Letter` followed by the literal
byte `\xFF`) is, for example.
Commit	Line	Data
94b46f34 XL	1	# Unicode conformance
	2
	3	This document describes the regex crate's conformance to Unicode's
17df50a5	4	[UTS#18](http://unicode.org/reports/tr18/)
94b46f34 XL	5	report, which lays out 3 levels of support: Basic, Extended and Tailored.
	6
	7	Full support for Level 1 ("Basic Unicode Support") is provided with two
	8	exceptions:
	9
	10	1. Line boundaries are not Unicode aware. Namely, only the `\n`
	11	(`END OF LINE`) character is recognized as a line boundary.
	12	2. The compatibility properties specified by
17df50a5	13	[RL1.2a](http://unicode.org/reports/tr18/#RL1.2a)
94b46f34 XL	14	are ASCII-only definitions.
	15
	16	Little to no support is provided for either Level 2 or Level 3. For the most
	17	part, this is because the features are either complex/hard to implement, or at
	18	the very least, very difficult to implement without sacrificing performance.
	19	For example, tackling canonical equivalence such that matching worked as one
	20	would expect regardless of normalization form would be a significant
	21	undertaking. This is at least partially a result of the fact that this regex
	22	engine is based on finite automata, which admits less flexibility normally
	23	associated with backtracking implementations.
	24
	25
	26	## RL1.1 Hex Notation
	27
	28	[UTS#18 RL1.1](https://unicode.org/reports/tr18/#Hex_notation)
	29
	30	Hex Notation refers to the ability to specify a Unicode code point in a regular
	31	expression via its hexadecimal code point representation. This is useful in
	32	environments that have poor Unicode font rendering or if you need to express a
	33	code point that is not normally displayable. All forms of hexadecimal notation
	34	are supported
	35
	36	\x7F hex character code (exactly two digits)
	37	\x{10FFFF} any hex character code corresponding to a Unicode code point
	38	\u007F hex character code (exactly four digits)
	39	\u{7F} any hex character code corresponding to a Unicode code point
	40	\U0000007F hex character code (exactly eight digits)
	41	\U{7F} any hex character code corresponding to a Unicode code point
	42
	43	Briefly, the `\x{...}`, `\u{...}` and `\U{...}` are all exactly equivalent ways
	44	of expressing hexadecimal code points. Any number of digits can be written
	45	within the brackets. In contrast, `\xNN`, `\uNNNN`, `\UNNNNNNNN` are all
	46	fixed-width variants of the same idea.
	47
	48	Note that when Unicode mode is disabled, any non-ASCII Unicode codepoint is
	49	banned. Additionally, the `\xNN` syntax represents arbitrary bytes when Unicode
	50	mode is disabled. That is, the regex `\xFF` matches the Unicode codepoint
	51	U+00FF (encoded as `\xC3\xBF` in UTF-8) while the regex `(?-u)\xFF` matches
	52	the literal byte `\xFF`.
	53
	54
	55	## RL1.2 Properties
	56
	57	[UTS#18 RL1.2](https://unicode.org/reports/tr18/#Categories)
	58
	59	Full support for Unicode property syntax is provided. Unicode properties
	60	provide a convenient way to construct character classes of groups of code
	61	points specified by Unicode. The regex crate does not provide exhaustive
	62	support, but covers a useful subset. In particular:
	63
17df50a5 XL	64	* [General categories](http://unicode.org/reports/tr18/#General_Category_Property)
	65	* [Scripts and Script Extensions](http://unicode.org/reports/tr18/#Script_Property)
	66	* [Age](http://unicode.org/reports/tr18/#Age)
94b46f34	67	* A smattering of boolean properties, including all of those specified by
17df50a5	68	[RL1.2](http://unicode.org/reports/tr18/#RL1.2) explicitly.
94b46f34 XL	69
	70	In all cases, property name and value abbreviations are supported, and all
	71	names/values are matched loosely without regard for case, whitespace or
	72	underscores. Property name aliases can be found in Unicode's
17df50a5	73	[`PropertyAliases.txt`](http://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt)
94b46f34	74	file, while property value aliases can be found in Unicode's
17df50a5	75	[`PropertyValueAliases.txt`](http://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
94b46f34 XL	76	file.
	77
	78	The syntax supported is also consistent with the UTS#18 recommendation:
	79
	80	* `\p{Greek}` selects the `Greek` script. Equivalent expressions follow:
	81	`\p{sc:Greek}`, `\p{Script:Greek}`, `\p{Sc=Greek}`, `\p{script=Greek}`,
	82	`\P{sc!=Greek}`. Similarly for `General_Category` (or `gc` for short) and
	83	`Script_Extensions` (or `scx` for short).
	84	* `\p{age:3.2}` selects all code points in Unicode 3.2.
	85	* `\p{Alphabetic}` selects the "alphabetic" property and can be abbreviated
	86	via `\p{alpha}` (for example).
	87	* Single letter variants for properties with single letter abbreviations.
	88	For example, `\p{Letter}` can be equivalently written as `\pL`.
	89
	90	The following is a list of all properties supported by the regex crate (starred
	91	properties correspond to properties required by RL1.2):
	92
	93	* `General_Category` \* (including `Any`, `ASCII` and `Assigned`)
	94	* `Script` \*
	95	* `Script_Extensions` \*
	96	* `Age`
	97	* `ASCII_Hex_Digit`
	98	* `Alphabetic` \*
	99	* `Bidi_Control`
	100	* `Case_Ignorable`
	101	* `Cased`
	102	* `Changes_When_Casefolded`
	103	* `Changes_When_Casemapped`
	104	* `Changes_When_Lowercased`
	105	* `Changes_When_Titlecased`
	106	* `Changes_When_Uppercased`
	107	* `Dash`
	108	* `Default_Ignorable_Code_Point` \*
	109	* `Deprecated`
	110	* `Diacritic`
0731742a XL	111	* `Emoji`
	112	* `Emoji_Presentation`
	113	* `Emoji_Modifier`
	114	* `Emoji_Modifier_Base`
	115	* `Emoji_Component`
	116	* `Extended_Pictographic`
94b46f34 XL	117	* `Extender`
94b46f34 XL	118	* `Grapheme_Base`
0731742a	119	* `Grapheme_Cluster_Break`
94b46f34 XL	120	* `Grapheme_Extend`
	121	* `Hex_Digit`
	122	* `IDS_Binary_Operator`
	123	* `IDS_Trinary_Operator`
	124	* `ID_Continue`
	125	* `ID_Start`
	126	* `Join_Control`
	127	* `Logical_Order_Exception`
	128	* `Lowercase` \*
	129	* `Math`
	130	* `Noncharacter_Code_Point` \*
	131	* `Pattern_Syntax`
	132	* `Pattern_White_Space`
	133	* `Prepended_Concatenation_Mark`
	134	* `Quotation_Mark`
	135	* `Radical`
	136	* `Regional_Indicator`
0731742a	137	* `Sentence_Break`
94b46f34 XL	138	* `Sentence_Terminal`
	139	* `Soft_Dotted`
	140	* `Terminal_Punctuation`
	141	* `Unified_Ideograph`
	142	* `Uppercase` \*
	143	* `Variation_Selector`
	144	* `White_Space` \*
0731742a	145	* `Word_Break`
94b46f34 XL	146	* `XID_Continue`
	147	* `XID_Start`
	148
	149
	150	## RL1.2a Compatibility Properties
	151
17df50a5	152	[UTS#18 RL1.2a](http://unicode.org/reports/tr18/#RL1.2a)
94b46f34 XL	153
94b46f34 XL	154	The regex crate only provides ASCII definitions of the
17df50a5	155	[compatibility properties documented in UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties)
94b46f34 XL	156	(sans the `\X` class, for matching grapheme clusters, which isn't provided
	157	at all). This is because it seems to be consistent with most other regular
	158	expression engines, and in particular, because these are often referred to as
	159	"ASCII" or "POSIX" character classes.
	160
	161	Note that the `\w`, `\s` and `\d` character classes are Unicode aware.
	162	Their traditional ASCII definition can be used by disabling Unicode. That is,
	163	`[[:word:]]` and `(?-u)\w` are equivalent.
	164
	165
	166	## RL1.3 Subtraction and Intersection
	167
17df50a5	168	[UTS#18 RL1.3](http://unicode.org/reports/tr18/#Subtraction_and_Intersection)
94b46f34 XL	169
	170	The regex crate provides full support for nested character classes, along with
	171	union, intersection (`&&`), difference (`--`) and symmetric difference (`~~`)
	172	operations on arbitrary character classes.
	173
	174	For example, to match all non-ASCII letters, you could use either
	175	`[\p{Letter}--\p{Ascii}]` (difference) or `[\p{Letter}&&[^\p{Ascii}]]`
	176	(intersecting the negation).
	177
	178
	179	## RL1.4 Simple Word Boundaries
	180
17df50a5	181	[UTS#18 RL1.4](http://unicode.org/reports/tr18/#Simple_Word_Boundaries)
94b46f34 XL	182
	183	The regex crate provides basic Unicode aware word boundary assertions. A word
	184	boundary assertion can be written as `\b`, or `\B` as its negation. A word
	185	boundary negation corresponds to a zero-width match, where its adjacent
	186	characters correspond to word and non-word, or non-word and word characters.
	187
	188	Conformance in this case chooses to define word character in the same way that
	189	the `\w` character class is defined: a code point that is a member of one of
	190	the following classes:
	191
	192	* `\p{Alphabetic}`
	193	* `\p{Join_Control}`
	194	* `\p{gc:Mark}`
	195	* `\p{gc:Decimal_Number}`
	196	* `\p{gc:Connector_Punctuation}`
	197
	198	In particular, this differs slightly from the
17df50a5	199	[prescription given in RL1.4](http://unicode.org/reports/tr18/#Simple_Word_Boundaries)
94b46f34	200	but is permissible according to
17df50a5	201	[UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties).
94b46f34 XL	202	Namely, it is convenient and simpler to have `\w` and `\b` be in sync with
	203	one another.
	204
	205	Finally, Unicode word boundaries can be disabled, which will cause ASCII word
	206	boundaries to be used instead. That is, `\b` is a Unicode word boundary while
	207	`(?-u)\b` is an ASCII-only word boundary. This can occasionally be beneficial
	208	if performance is important, since the implementation of Unicode word
	209	boundaries is currently sub-optimal on non-ASCII text.
	210
	211
	212	## RL1.5 Simple Loose Matches
	213
17df50a5	214	[UTS#18 RL1.5](http://unicode.org/reports/tr18/#Simple_Loose_Matches)
94b46f34 XL	215
	216	The regex crate provides full support for case insensitive matching in
	217	accordance with RL1.5. That is, it uses the "simple" case folding mapping. The
	218	"simple" mapping was chosen because of a key convenient property: every
	219	"simple" mapping is a mapping from exactly one code point to exactly one other
	220	code point. This makes case insensitive matching of character classes, for
	221	example, straight-forward to implement.
	222
	223	When case insensitive mode is enabled (e.g., `(?i)[a]` is equivalent to `a\|A`),
	224	then all characters classes are case folded as well.
	225
	226
	227	## RL1.6 Line Boundaries
	228
17df50a5	229	[UTS#18 RL1.6](http://unicode.org/reports/tr18/#Line_Boundaries)
94b46f34 XL	230
	231	The regex crate only provides support for recognizing the `\n` (`END OF LINE`)
	232	character as a line boundary. This choice was made mostly for implementation
	233	convenience, and to avoid performance cliffs that Unicode word boundaries are
	234	subject to.
	235
	236	Ideally, it would be nice to at least support `\r\n` as a line boundary as
	237	well, and in theory, this could be done efficiently.
	238
	239
	240	## RL1.7 Code Points
	241
17df50a5	242	[UTS#18 RL1.7](http://unicode.org/reports/tr18/#Supplementary_Characters)
94b46f34 XL	243
	244	The regex crate provides full support for Unicode code point matching. Namely,
	245	the fundamental atom of any match is always a single code point.
	246
	247	Given Rust's strong ties to UTF-8, the following guarantees are also provided:
	248
	249	* All matches are reported on valid UTF-8 code unit boundaries. That is, any
	250	match range returned by the public regex API is guaranteed to successfully
	251	slice the string that was searched.
	252	* By consequence of the above, it is impossible to match surrogode code points.
	253	No support for UTF-16 is provided, so this is never necessary.
	254
	255	Note that when Unicode mode is disabled, the fundamental atom of matching is
	256	no longer a code point but a single byte. When Unicode mode is disabled, many
	257	Unicode features are disabled as well. For example, `(?-u)\pL` is not a valid
	258	regex but `\pL(?-u)\xFF` (matches any Unicode `Letter` followed by the literal
	259	byte `\xFF`) is, for example.