ceph/src/boost/libs/regex/doc/syntax_basic.qbk

   1 [/
   2   Copyright 2006-2007 John Maddock.
   3   Distributed under the Boost Software License, Version 1.0.
   4   (See accompanying file LICENSE_1_0.txt or copy at
   5   http://www.boost.org/LICENSE_1_0.txt).
   6 ]
   7
   8
   9 [section:basic_syntax POSIX Basic Regular Expression Syntax]
  10
  11 [h3 Synopsis]
  12
  13 The POSIX-Basic regular expression syntax is used by the Unix utility `sed`,
  14 and variations are used by `grep` and `emacs`.  You can construct POSIX
  15 basic regular expressions in Boost.Regex by passing the flag `basic` to the
  16 regex constructor (see [syntax_option_type]), for example:
  17
  18    // e1 is a case sensitive POSIX-Basic expression:
  19    boost::regex e1(my_expression, boost::regex::basic);
  20    // e2 a case insensitive POSIX-Basic expression:
  21    boost::regex e2(my_expression, boost::regex::basic|boost::regex::icase);
  22
  23 [#boost_regex.posix_basic][h3 POSIX Basic Syntax]
  24
  25 In POSIX-Basic regular expressions, all characters are match themselves except
  26 for the following special characters:
  27
  28 [pre .\[\\*^$]
  29
  30 [h4 Wildcard:]
  31
  32 The single character '.' when used outside of a character set will match any
  33 single character except:
  34
  35 * The NULL character when the flag `match_no_dot_null` is passed to the
  36 matching algorithms.
  37 * The newline character when the flag `match_not_dot_newline` is passed to
  38 the matching algorithms.
  39
  40 [h4 Anchors:]
  41
  42 A '^' character shall match the start of a line when used as the first
  43 character of an expression, or the first character of a sub-expression.
  44
  45 A '$' character shall match the end of a line when used as the last
  46 character of an expression, or the last character of a sub-expression.
  47
  48 [h4 Marked sub-expressions:]
  49
  50 A section beginning `\(` and ending `\)` acts as a marked sub-expression.
  51 Whatever matched the sub-expression is split out in a separate field by the
  52 matching algorithms.  Marked sub-expressions can also repeated, or
  53 referred-to by a back-reference.
  54
  55 [h4 Repeats:]
  56
  57 Any atom (a single character, a marked sub-expression, or a character class)
  58 can be repeated with the \* operator.
  59
  60 For example `a*` will match any number of letter a's repeated zero or more
  61 times (an atom repeated zero times matches an empty string), so the
  62 expression `a*b` will match any of the following:
  63
  64 [pre
  65 b
  66 ab
  67 aaaaaaaab
  68 ]
  69
  70 An atom can also be repeated with a bounded repeat:
  71
  72 `a\{n\}`  Matches 'a' repeated exactly n times.
  73
  74 `a\{n,\}`  Matches 'a' repeated n or more times.
  75
  76 `a\{n, m\}`  Matches 'a' repeated between n and m times inclusive.
  77
  78 For example:
  79
  80 [pre ^a\{2,3\}$]
  81
  82 Will match either of:
  83
  84 [pre
  85 aa
  86 aaa
  87 ]
  88
  89 But neither of:
  90
  91 [pre
  92 a
  93 aaaa
  94 ]
  95
  96 It is an error to use a repeat operator, if the preceding construct can not be
  97 repeated, for example:
  98
  99 [pre a\(*\)]
 100
 101 Will raise an error, as there is nothing for the \* operator to be applied to.
 102
 103 [h4 Back references:]
 104
 105 An escape character followed by a digit /n/, where /n/ is in the range 1-9,
 106 matches the same string that was matched by sub-expression /n/.  For example
 107 the expression:
 108
 109 [pre ^\\(a\*\\).\*\\1$]
 110
 111 Will match the string:
 112
 113 [pre aaabbaaa]
 114
 115 But not the string:
 116
 117 [pre aaabba]
 118
 119 [h4 Character sets:]
 120
 121 A character set is a bracket-expression starting with \[ and ending with \],
 122 it defines a set of characters, and matches any single character that is a
 123 member of that set.
 124
 125 A bracket expression may contain any combination of the following:
 126
 127 [h5 Single characters:]
 128
 129 For example `[abc]`, will match any of the characters 'a', 'b', or 'c'.
 130
 131 [h5 Character ranges:]
 132
 133 For example `[a-c]` will match any single character in the range 'a' to 'c'.
 134 By default, for POSIX-Basic regular expressions, a character /x/ is within the
 135 range /y/ to /z/, if it collates within that range; this results in
 136 locale specific behavior.  This behavior can be turned off by unsetting
 137 the `collate` option flag when constructing the regular expression
 138 - in which case whether a character appears within
 139 a range is determined by comparing the code points of the characters only.
 140
 141 [h5 Negation:]
 142
 143 If the bracket-expression begins with the ^ character, then it matches the
 144 complement of the characters it contains, for example `[^a-c]` matches
 145 any character that is not in the range a-c.
 146
 147 [h5 Character classes:]
 148
 149 An expression of the form `[[:name:]]` matches the named character class "name",
 150 for example `[[:lower:]]` matches any lower case character.
 151 See [link boost_regex.syntax.character_classes character class names].
 152
 153 [h5 Collating Elements:]
 154
 155 An expression of the form `[[.col.]` matches the collating element /col/.
 156 A collating element is any single character, or any sequence of
 157 characters that collates as a single unit.  Collating elements may also
 158 be used as the end point of a range, for example: `[[.ae.]-c]` matches
 159 the character sequence "ae", plus any single character in the range "ae"-c,
 160 assuming that "ae" is treated as a single collating element in the current locale.
 161
 162 Collating elements may be used in place of escapes (which are not
 163 normally allowed inside character sets), for example `[[.^.]abc]` would
 164 match either one of the characters 'abc^'.
 165
 166 As an extension, a collating element may also be specified via its
 167 symbolic name, for example:
 168
 169 [pre \[\[\.NUL\.\]\]]
 170
 171 matches a 'NUL' character.
 172 See [link boost_regex.syntax.collating_names collating element names].
 173
 174 [h5 Equivalence classes:]
 175
 176 An expression of the form `[[=col=]]`, matches any character or collating
 177 element whose primary sort key is the same as that for collating element
 178 /col/, as with collating elements the name /col/ may be a
 179 [link boost_regex.syntax.collating_names collating symbolic name].
 180 A primary sort key is one that ignores case, accentation, or
 181 locale-specific tailorings; so for example `[[=a=]]` matches any of
 182 the characters: a, '''&#xC0;''', '''&#xC1;''', '''&#xC2;''',
 183 '''&#xC3;''', '''&#xC4;''', '''&#xC5;''', A, '''&#xE0;''', '''&#xE1;''',
 184 '''&#xE2;''', '''&#xE3;''', '''&#xE4;''' and '''&#xE5;'''.
 185 Unfortunately implementation of this is reliant on the platform's
 186 collation and localisation support; this feature can not be relied
 187 upon to work portably across all platforms, or even all locales on one platform.
 188
 189 [h5 Combinations:]
 190
 191 All of the above can be combined in one character set declaration, for
 192 example: `[[:digit:]a-c[.NUL.]].`
 193
 194 [h4 Escapes]
 195
 196 With the exception of the escape sequences \\{, \\}, \\(, and \\),
 197 which are documented above, an escape followed by any character matches
 198 that character.  This can be used to make the special characters
 199
 200 [pre .\[\\\*^$]
 201
 202 "ordinary".  Note that the escape character loses its special meaning
 203 inside a character set, so `[\^]` will match either a literal '\\' or a '^'.
 204
 205 [h3 What Gets Matched]
 206
 207 When there is more that one way to match a regular expression, the
 208 "best" possible match is obtained using the
 209 [link boost_regex.syntax.leftmost_longest_rule leftmost-longest rule].
 210
 211 [h3 Variations]
 212
 213 [#boost_regex.grep_syntax][h4 Grep]
 214
 215 When an expression is compiled with the flag `grep` set, then the
 216 expression is treated as a newline separated list of
 217 [link boost_regex.posix_basic POSIX-Basic expressions],
 218 a match is found if any of the expressions in the list match, for example:
 219
 220    boost::regex e("abc\ndef", boost::regex::grep);
 221
 222 will match either of the [link boost_regex.posix_basic POSIX-Basic expressions]
 223 "abc" or "def".
 224
 225 As its name suggests, this behavior is consistent with the Unix utility grep.
 226
 227 [h4 emacs]
 228
 229 In addition to the [link boost_regex.posix_basic POSIX-Basic features]
 230 the following characters are also special:
 231
 232 [table
 233 [[Character][Description]]
 234 [[+][repeats the preceding atom one or more times.]]
 235 [[?][repeats the preceding atom zero or one times.]]
 236 [[*?][A non-greedy version of *.]]
 237 [[+?][A non-greedy version of +.]]
 238 [[??][A non-greedy version of ?.]]
 239 ]
 240
 241 And the following escape sequences are also recognised:
 242
 243 [table
 244 [[Escape][Description]]
 245 [[\\|][specifies an alternative.]]
 246 [[\\(?:  ...  \)][is a non-marking grouping construct - allows you to lexically group something without spitting out an extra sub-expression.]]
 247 [[\\w][matches any word character.]]
 248 [[\\W][matches any non-word character.]]
 249 [[\\sx][matches any character in the syntax group x, the following
 250    emacs groupings are supported: 's', ' ', '_', 'w', '.', ')', '(', '"', '\\'', '>' and '<'.  Refer to the emacs docs for details.]]
 251 [[\\Sx][matches any character not in the syntax grouping x.]]
 252 [[\\c and \\C][These are not supported.]]
 253 [[\\`][matches zero characters only at the start of a buffer (or string being matched).]]
 254 [[\\'][matches zero characters only at the end of a buffer (or string being matched).]]
 255 [[\\b][matches zero characters at a word boundary.]]
 256 [[\\B][matches zero characters, not at a word boundary.]]
 257 [[\\<][matches zero characters only at the start of a word.]]
 258 [[\\>][matches zero characters only at the end of a word.]]
 259 ]
 260
 261 Finally, you should note that emacs style regular expressions are matched
 262 according to the
 263 [link boost_regex.syntax.perl_syntax.what_gets_matched Perl "depth first search" rules].
 264 Emacs expressions are
 265 matched this way because they contain Perl-like extensions, that do not
 266 interact well with the
 267 [link boost_regex.syntax.leftmost_longest_rule POSIX-style leftmost-longest rule].
 268
 269 [h3 Options]
 270
 271 There are a [link boost_regex.ref.syntax_option_type.syntax_option_type_basic variety of flags] that may be combined with the `basic` and `grep`
 272 options when constructing the regular expression, in particular note
 273 that the
 274 [link boost_regex.ref.syntax_option_type.syntax_option_type_basic `newline_alt`, `no_char_classes`, `no-intervals`, `bk_plus_qm`
 275 and `bk_plus_vbar`] options all alter the syntax, while the
 276 [link boost_regex.ref.syntax_option_type.syntax_option_type_basic `collate` and `icase` options] modify how the case and locale sensitivity
 277 are to be applied.
 278
 279 [h3 References]
 280
 281 [@http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap09.html IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and Headers, Section 9, Regular Expressions (FWD.1).]
 282
 283 [@http://www.opengroup.org/onlinepubs/000095399/utilities/grep.html IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, Utilities, grep (FWD.1).]
 284
 285 [@http://www.gnu.org/software/emacs/ Emacs Version 21.3.]
 286
 287 [endsect]
 288
 289