]>
Commit | Line | Data |
---|---|---|
1 | [/ | |
2 | Copyright 2006-2007 John Maddock. | |
3 | Distributed under the Boost Software License, Version 1.0. | |
4 | (See accompanying file LICENSE_1_0.txt or copy at | |
5 | http://www.boost.org/LICENSE_1_0.txt). | |
6 | ] | |
7 | ||
8 | ||
9 | [section:perl_syntax Perl Regular Expression Syntax] | |
10 | ||
11 | [h3 Synopsis] | |
12 | ||
13 | The Perl regular expression syntax is based on that used by the | |
14 | programming language Perl . Perl regular expressions are the | |
15 | default behavior in Boost.Regex or you can pass the flag [^perl] to the | |
16 | [basic_regex] constructor, for example: | |
17 | ||
18 | // e1 is a case sensitive Perl regular expression: | |
19 | // since Perl is the default option there's no need to explicitly specify the syntax used here: | |
20 | boost::regex e1(my_expression); | |
21 | // e2 a case insensitive Perl regular expression: | |
22 | boost::regex e2(my_expression, boost::regex::perl|boost::regex::icase); | |
23 | ||
24 | [h3 Perl Regular Expression Syntax] | |
25 | ||
26 | In Perl regular expressions, all characters match themselves except for the | |
27 | following special characters: | |
28 | ||
29 | [pre .\[{}()\\\*+?|^$] | |
30 | ||
31 | [h4 Wildcard] | |
32 | ||
33 | The single character '.' when used outside of a character set will match | |
34 | any single character except: | |
35 | ||
36 | * The NULL character when the [link boost_regex.ref.match_flag_type flag | |
37 | [^match_not_dot_null]] is passed to the matching algorithms. | |
38 | * The newline character when the [link boost_regex.ref.match_flag_type | |
39 | flag [^match_not_dot_newline]] is passed to | |
40 | the matching algorithms. | |
41 | ||
42 | [h4 Anchors] | |
43 | ||
44 | A '^' character shall match the start of a line. | |
45 | ||
46 | A '$' character shall match the end of a line. | |
47 | ||
48 | [h4 Marked sub-expressions] | |
49 | ||
50 | A section beginning [^(] and ending [^)] acts as a marked sub-expression. | |
51 | Whatever matched the sub-expression is split out in a separate field by | |
52 | the matching algorithms. Marked sub-expressions can also repeated, or | |
53 | referred to by a back-reference. | |
54 | ||
55 | [h4 Non-marking grouping] | |
56 | ||
57 | A marked sub-expression is useful to lexically group part of a regular | |
58 | expression, but has the side-effect of spitting out an extra field in | |
59 | the result. As an alternative you can lexically group part of a | |
60 | regular expression, without generating a marked sub-expression by using | |
61 | [^(?:] and [^)] , for example [^(?:ab)+] will repeat [^ab] without splitting | |
62 | out any separate sub-expressions. | |
63 | ||
64 | [h4 Repeats] | |
65 | ||
66 | Any atom (a single character, a marked sub-expression, or a character class) | |
67 | can be repeated with the [^*], [^+], [^?], and [^{}] operators. | |
68 | ||
69 | The [^*] operator will match the preceding atom zero or more times, | |
70 | for example the expression [^a*b] will match any of the following: | |
71 | ||
72 | b | |
73 | ab | |
74 | aaaaaaaab | |
75 | ||
76 | The [^+] operator will match the preceding atom one or more times, for | |
77 | example the expression [^a+b] will match any of the following: | |
78 | ||
79 | ab | |
80 | aaaaaaaab | |
81 | ||
82 | But will not match: | |
83 | ||
84 | b | |
85 | ||
86 | The [^?] operator will match the preceding atom zero or one times, for | |
87 | example the expression ca?b will match any of the following: | |
88 | ||
89 | cb | |
90 | cab | |
91 | ||
92 | But will not match: | |
93 | ||
94 | caab | |
95 | ||
96 | An atom can also be repeated with a bounded repeat: | |
97 | ||
98 | [^a{n}] Matches 'a' repeated exactly n times. | |
99 | ||
100 | [^a{n,}] Matches 'a' repeated n or more times. | |
101 | ||
102 | [^a{n, m}] Matches 'a' repeated between n and m times inclusive. | |
103 | ||
104 | For example: | |
105 | ||
106 | [pre ^a{2,3}$] | |
107 | ||
108 | Will match either of: | |
109 | ||
110 | aa | |
111 | aaa | |
112 | ||
113 | But neither of: | |
114 | ||
115 | a | |
116 | aaaa | |
117 | ||
118 | Note that the "{" and "}" characters will treated as ordinary literals when used | |
119 | in a context that is not a repeat: this matches Perl 5.x behavior. For example in | |
120 | the expressions "ab{1", "ab1}" and "a{b}c" the curly brackets are all treated as | |
121 | literals and ['no error will be raised]. | |
122 | ||
123 | It is an error to use a repeat operator, if the preceding construct can not | |
124 | be repeated, for example: | |
125 | ||
126 | a(*) | |
127 | ||
128 | Will raise an error, as there is nothing for the [^*] operator to be applied to. | |
129 | ||
130 | [h4 Non greedy repeats] | |
131 | ||
132 | The normal repeat operators are "greedy", that is to say they will consume as | |
133 | much input as possible. There are non-greedy versions available that will | |
134 | consume as little input as possible while still producing a match. | |
135 | ||
136 | [^*?] Matches the previous atom zero or more times, while consuming as little | |
137 | input as possible. | |
138 | ||
139 | [^+?] Matches the previous atom one or more times, while consuming as | |
140 | little input as possible. | |
141 | ||
142 | [^??] Matches the previous atom zero or one times, while consuming | |
143 | as little input as possible. | |
144 | ||
145 | [^{n,}?] Matches the previous atom n or more times, while consuming as | |
146 | little input as possible. | |
147 | ||
148 | [^{n,m}?] Matches the previous atom between n and m times, while | |
149 | consuming as little input as possible. | |
150 | ||
151 | [h4 Possessive repeats] | |
152 | ||
153 | By default when a repeated pattern does not match then the engine will backtrack until | |
154 | a match is found. However, this behaviour can sometime be undesireble so there are | |
155 | also "possessive" repeats: these match as much as possible and do not then allow | |
156 | backtracking if the rest of the expression fails to match. | |
157 | ||
158 | [^*+] Matches the previous atom zero or more times, while giving nothing back. | |
159 | ||
160 | [^++] Matches the previous atom one or more times, while giving nothing back. | |
161 | ||
162 | [^?+] Matches the previous atom zero or one times, while giving nothing back. | |
163 | ||
164 | [^{n,}+] Matches the previous atom n or more times, while giving nothing back. | |
165 | ||
166 | [^{n,m}+] Matches the previous atom between n and m times, while giving nothing back. | |
167 | ||
168 | [h4 Back references] | |
169 | ||
170 | An escape character followed by a digit /n/, where /n/ is in the range 1-9, | |
171 | matches the same string that was matched by sub-expression /n/. For example | |
172 | the expression: | |
173 | ||
174 | [pre ^(a\*).\*\\1$] | |
175 | ||
176 | Will match the string: | |
177 | ||
178 | aaabbaaa | |
179 | ||
180 | But not the string: | |
181 | ||
182 | aaabba | |
183 | ||
184 | You can also use the \g escape for the same function, for example: | |
185 | ||
186 | [table | |
187 | [[Escape][Meaning]] | |
188 | [[[^\g1]][Match whatever matched sub-expression 1]] | |
189 | [[[^\g{1}]][Match whatever matched sub-expression 1: this form allows for safer | |
190 | parsing of the expression in cases like [^\g{1}2] or for indexes higher than 9 as in [^\g{1234}]]] | |
191 | [[[^\g-1]][Match whatever matched the last opened sub-expression]] | |
192 | [[[^\g{-2}]][Match whatever matched the last but one opened sub-expression]] | |
193 | [[[^\g{one}]][Match whatever matched the sub-expression named "one"]] | |
194 | ] | |
195 | ||
196 | Finally the \k escape can be used to refer to named subexpressions, for example [^\k<two>] will match | |
197 | whatever matched the subexpression named "two". | |
198 | ||
199 | [h4 Alternation] | |
200 | ||
201 | The [^|] operator will match either of its arguments, so for example: | |
202 | [^abc|def] will match either "abc" or "def". | |
203 | ||
204 | Parenthesis can be used to group alternations, for example: [^ab(d|ef)] | |
205 | will match either of "abd" or "abef". | |
206 | ||
207 | Empty alternatives are not allowed (these are almost always a mistake), but | |
208 | if you really want an empty alternative use [^(?:)] as a placeholder, for example: | |
209 | ||
210 | [^|abc] is not a valid expression, but | |
211 | ||
212 | [^(?:)|abc] is and is equivalent, also the expression: | |
213 | ||
214 | [^(?:abc)??] has exactly the same effect. | |
215 | ||
216 | [h4 Character sets] | |
217 | ||
218 | A character set is a bracket-expression starting with [^[] and ending with [^]], | |
219 | it defines a set of characters, and matches any single character that is a | |
220 | member of that set. | |
221 | ||
222 | A bracket expression may contain any combination of the following: | |
223 | ||
224 | [h5 Single characters] | |
225 | ||
226 | For example [^\[abc\]], will match any of the characters 'a', 'b', or 'c'. | |
227 | ||
228 | [h5 Character ranges] | |
229 | ||
230 | For example [^\[a-c\]] will match any single character in the range 'a' to 'c'. | |
231 | By default, for Perl regular expressions, a character x is within the | |
232 | range y to z, if the code point of the character lies within the codepoints of | |
233 | the endpoints of the range. Alternatively, if you set the | |
234 | [link boost_regex.ref.syntax_option_type.syntax_option_type_perl [^collate] flag] | |
235 | when constructing the regular expression, then ranges are locale sensitive. | |
236 | ||
237 | [h5 Negation] | |
238 | ||
239 | If the bracket-expression begins with the ^ character, then it matches the | |
240 | complement of the characters it contains, for example [^\[^a-c\]] matches | |
241 | any character that is not in the range [^a-c]. | |
242 | ||
243 | [h5 Character classes] | |
244 | ||
245 | An expression of the form [^\[\[:name:\]\]] matches the named character class | |
246 | "name", for example [^\[\[:lower:\]\]] matches any lower case character. | |
247 | See [link boost_regex.syntax.character_classes character class names]. | |
248 | ||
249 | [h5 Collating Elements] | |
250 | ||
251 | An expression of the form [^\[\[.col.\]\]] matches the collating element /col/. | |
252 | A collating element is any single character, or any sequence of characters | |
253 | that collates as a single unit. Collating elements may also be used | |
254 | as the end point of a range, for example: [^\[\[.ae.\]-c\]] matches the | |
255 | character sequence "ae", plus any single character in the range "ae"-c, | |
256 | assuming that "ae" is treated as a single collating element in the current locale. | |
257 | ||
258 | As an extension, a collating element may also be specified via it's | |
259 | [link boost_regex.syntax.collating_names symbolic name], for example: | |
260 | ||
261 | [[.NUL.]] | |
262 | ||
263 | matches a [^\0] character. | |
264 | ||
265 | [h5 Equivalence classes] | |
266 | ||
267 | An expression of the form [^\[\[\=col\=\]\]], matches any character or collating element | |
268 | whose primary sort key is the same as that for collating element /col/, as with | |
269 | collating elements the name /col/ may be a | |
270 | [link boost_regex.syntax.collating_names symbolic name]. A primary sort key is | |
271 | one that ignores case, accentation, or locale-specific tailorings; so for | |
272 | example `[[=a=]]` matches any of the characters: | |
273 | a, '''À''', '''Á''', '''Â''', | |
274 | '''Ã''', '''Ä''', '''Å''', A, '''à''', '''á''', | |
275 | '''â''', '''ã''', '''ä''' and '''å'''. | |
276 | Unfortunately implementation of this is reliant on the platform's collation | |
277 | and localisation support; this feature can not be relied upon to work portably | |
278 | across all platforms, or even all locales on one platform. | |
279 | ||
280 | [h5 Escaped Characters] | |
281 | ||
282 | All the escape sequences that match a single character, or a single character | |
283 | class are permitted within a character class definition. For example | |
284 | `[\[\]]` would match either of `[` or `]` while `[\W\d]` would match any character | |
285 | that is either a "digit", /or/ is /not/ a "word" character. | |
286 | ||
287 | [h5 Combinations] | |
288 | ||
289 | All of the above can be combined in one character set declaration, for example: | |
290 | [^\[\[:digit:\]a-c\[.NUL.\]\]]. | |
291 | ||
292 | [h4 Escapes] | |
293 | ||
294 | Any special character preceded by an escape shall match itself. | |
295 | ||
296 | The following escape sequences are all synonyms for single characters: | |
297 | ||
298 | [table | |
299 | [[Escape][Character]] | |
300 | [[[^\a]][[^\a]]] | |
301 | [[[^\e]][[^0x1B]]] | |
302 | [[[^\f]][[^\f]]] | |
303 | [[[^\n]][[^\n]]] | |
304 | [[[^\r]][[^\r]]] | |
305 | [[[^\t]][[^\t]]] | |
306 | [[[^\v]][[^\v]]] | |
307 | [[[^\b]][[^\b] (but only inside a character class declaration).]] | |
308 | [[[^\cX]][An ASCII escape sequence - the character whose code point is X % 32]] | |
309 | [[[^\xdd]][A hexadecimal escape sequence - matches the single character whose | |
310 | code point is 0xdd.]] | |
311 | [[[^\x{dddd}]][A hexadecimal escape sequence - matches the single character whose | |
312 | code point is 0xdddd.]] | |
313 | [[[^\0ddd]][An octal escape sequence - matches the single character whose | |
314 | code point is 0ddd.]] | |
315 | [[[^\N{name}]][Matches the single character which has the | |
316 | [link boost_regex.syntax.collating_names symbolic name] /name/. | |
317 | For example [^\N{newline}] matches the single character \\n.]] | |
318 | ] | |
319 | ||
320 | [h5 "Single character" character classes:] | |
321 | ||
322 | Any escaped character /x/, if /x/ is the name of a character class shall | |
323 | match any character that is a member of that class, and any | |
324 | escaped character /X/, if /x/ is the name of a character class, shall | |
325 | match any character not in that class. | |
326 | ||
327 | The following are supported by default: | |
328 | ||
329 | [table | |
330 | [[Escape sequence][Equivalent to]] | |
331 | [[`\d`][`[[:digit:]]`]] | |
332 | [[`\l`][`[[:lower:]]`]] | |
333 | [[`\s`][`[[:space:]]`]] | |
334 | [[`\u`][`[[:upper:]]`]] | |
335 | [[`\w`][`[[:word:]]`]] | |
336 | [[`\h`][Horizontal whitespace]] | |
337 | [[`\v`][Vertical whitespace]] | |
338 | [[`\D`][`[^[:digit:]]`]] | |
339 | [[`\L`][`[^[:lower:]]`]] | |
340 | [[`\S`][`[^[:space:]]`]] | |
341 | [[`\U`][`[^[:upper:]]`]] | |
342 | [[`\W`][`[^[:word:]]`]] | |
343 | [[`\H`][Not Horizontal whitespace]] | |
344 | [[`\V`][Not Vertical whitespace]] | |
345 | ] | |
346 | ||
347 | [h5 Character Properties] | |
348 | ||
349 | The character property names in the following table are all equivalent | |
350 | to the [link boost_regex.syntax.character_classes names used in character classes]. | |
351 | ||
352 | [table | |
353 | [[Form][Description][Equivalent character set form]] | |
354 | [[`\pX`][Matches any character that has the property X.][`[[:X:]]`]] | |
355 | [[`\p{Name}`][Matches any character that has the property Name.][`[[:Name:]]`]] | |
356 | [[`\PX`][Matches any character that does not have the property X.][`[^[:X:]]`]] | |
357 | [[`\P{Name}`][Matches any character that does not have the property Name.][`[^[:Name:]]`]] | |
358 | ] | |
359 | ||
360 | For example [^\pd] matches any "digit" character, as does [^\p{digit}]. | |
361 | ||
362 | [h5 Word Boundaries] | |
363 | ||
364 | The following escape sequences match the boundaries of words: | |
365 | ||
366 | [^\<] Matches the start of a word. | |
367 | ||
368 | [^\>] Matches the end of a word. | |
369 | ||
370 | [^\b] Matches a word boundary (the start or end of a word). | |
371 | ||
372 | [^\B] Matches only when not at a word boundary. | |
373 | ||
374 | [h5 Buffer boundaries] | |
375 | ||
376 | The following match only at buffer boundaries: a "buffer" in this | |
377 | context is the whole of the input text that is being matched against | |
378 | (note that ^ and $ may match embedded newlines within the text). | |
379 | ||
380 | \\\` Matches at the start of a buffer only. | |
381 | ||
382 | \\' Matches at the end of a buffer only. | |
383 | ||
384 | \\A Matches at the start of a buffer only (the same as [^\\\`]). | |
385 | ||
386 | \\z Matches at the end of a buffer only (the same as [^\\']). | |
387 | ||
388 | \\Z Matches a zero-width assertion consisting of an optional sequence of newlines at the end of a buffer: | |
389 | equivalent to the regular expression [^(?=\\v*\\z)]. Note that this is subtly different from Perl which | |
390 | behaves as if matching [^(?=\\n?\\z)]. | |
391 | ||
392 | [h5 Continuation Escape] | |
393 | ||
394 | The sequence [^\G] matches only at the end of the last match found, or at | |
395 | the start of the text being matched if no previous match was found. | |
396 | This escape useful if you're iterating over the matches contained within a | |
397 | text, and you want each subsequence match to start where the last one ended. | |
398 | ||
399 | [h5 Quoting escape] | |
400 | ||
401 | The escape sequence [^\Q] begins a "quoted sequence": all the subsequent characters | |
402 | are treated as literals, until either the end of the regular expression or \\E | |
403 | is found. For example the expression: [^\Q\*+\Ea+] would match either of: | |
404 | ||
405 | \*+a | |
406 | \*+aaa | |
407 | ||
408 | [h5 Unicode escapes] | |
409 | ||
410 | [^\C] Matches a single code point: in Boost regex this has exactly the | |
411 | same effect as a "." operator. | |
412 | [^\X] Matches a combining character sequence: that is any non-combining | |
413 | character followed by a sequence of zero or more combining characters. | |
414 | ||
415 | [h5 Matching Line Endings] | |
416 | ||
417 | The escape sequence [^\R] matches any line ending character sequence, specifically it is identical to | |
418 | the expression [^(?>\x0D\x0A?|\[\x0A-\x0C\x85\x{2028}\x{2029}\])]. | |
419 | ||
420 | [h5 Keeping back some text] | |
421 | ||
422 | [^\K] Resets the start location of $0 to the current text position: in other words everything to the | |
423 | left of \K is "kept back" and does not form part of the regular expression match. $` is updated | |
424 | accordingly. | |
425 | ||
426 | For example [^foo\Kbar] matched against the text "foobar" would return the match "bar" for $0 and "foo" | |
427 | for $`. This can be used to simulate variable width lookbehind assertions. | |
428 | ||
429 | [h5 Any other escape] | |
430 | ||
431 | Any other escape sequence matches the character that is escaped, for example | |
432 | \\@ matches a literal '@'. | |
433 | ||
434 | [h4 Perl Extended Patterns] | |
435 | ||
436 | Perl-specific extensions to the regular expression syntax all start with [^(?]. | |
437 | ||
438 | [h5 Named Subexpressions] | |
439 | ||
440 | You can create a named subexpression using: | |
441 | ||
442 | (?<NAME>expression) | |
443 | ||
444 | Which can be then be referred to by the name /NAME/. Alternatively you can delimit the name | |
445 | using 'NAME' as in: | |
446 | ||
447 | (?'NAME'expression) | |
448 | ||
449 | These named subexpressions can be referred to in a backreference using either [^\g{NAME}] or [^\k<NAME>] | |
450 | and can also be referred to by name in a [perl_format] format string for search and replace operations, or in the | |
451 | [match_results] member functions. | |
452 | ||
453 | [h5 Comments] | |
454 | ||
455 | [^(?# ... )] is treated as a comment, it's contents are ignored. | |
456 | ||
457 | [h5 Modifiers] | |
458 | ||
459 | [^(?imsx-imsx ... )] alters which of the perl modifiers are in effect within | |
460 | the pattern, changes take effect from the point that the block is first seen | |
461 | and extend to any enclosing [^)]. Letters before a '-' turn that perl | |
462 | modifier on, letters afterward, turn it off. | |
463 | ||
464 | [^(?imsx-imsx:pattern)] applies the specified modifiers to pattern only. | |
465 | ||
466 | [h5 Non-marking groups] | |
467 | ||
468 | [^(?:pattern)] lexically groups pattern, without generating an additional | |
469 | sub-expression. | |
470 | ||
471 | [h5 Branch reset] | |
472 | ||
473 | [^(?|pattern)] resets the subexpression count at the start of each "|" alternative within /pattern/. | |
474 | ||
475 | The sub-expression count following this construct is that of whichever branch had the largest number of | |
476 | sub-expressions. This construct is useful when you want to capture one of a number of alternative matches | |
477 | in a single sub-expression index. | |
478 | ||
479 | In the following example the index of each sub-expression is shown below the expression: | |
480 | ||
481 | [pre | |
482 | # before ---------------branch-reset----------- after | |
483 | / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x | |
484 | # 1 2 2 3 2 3 4 | |
485 | ] | |
486 | ||
487 | [h5 Lookahead] | |
488 | ||
489 | [^(?=pattern)] consumes zero characters, only if pattern matches. | |
490 | ||
491 | [^(?!pattern)] consumes zero characters, only if pattern does not match. | |
492 | ||
493 | Lookahead is typically used to create the logical AND of two regular | |
494 | expressions, for example if a password must contain a lower case letter, | |
495 | an upper case letter, a punctuation symbol, and be at least 6 characters long, | |
496 | then the expression: | |
497 | ||
498 | (?=.*[[:lower:]])(?=.*[[:upper:]])(?=.*[[:punct:]]).{6,} | |
499 | ||
500 | could be used to validate the password. | |
501 | ||
502 | [h5 Lookbehind] | |
503 | ||
504 | [^(?<=pattern)] consumes zero characters, only if pattern could be matched | |
505 | against the characters preceding the current position (pattern must be | |
506 | of fixed length). | |
507 | ||
508 | [^(?<!pattern)] consumes zero characters, only if pattern could not be | |
509 | matched against the characters preceding the current position (pattern must | |
510 | be of fixed length). | |
511 | ||
512 | [h5 Independent sub-expressions] | |
513 | ||
514 | [^(?>pattern)] /pattern/ is matched independently of the surrounding patterns, | |
515 | the expression will never backtrack into /pattern/. Independent sub-expressions | |
516 | are typically used to improve performance; only the best possible match | |
517 | for pattern will be considered, if this doesn't allow the expression as a | |
518 | whole to match then no match is found at all. | |
519 | ||
520 | [h5 Recursive Expressions] | |
521 | ||
522 | [^(?['N]) (?-['N]) (?+['N]) (?R) (?0) (?&NAME)] | |
523 | ||
524 | [^(?R)] and [^(?0)] recurse to the start of the entire pattern. | |
525 | ||
526 | [^(?['N])] executes sub-expression /N/ recursively, for example [^(?2)] will recurse to sub-expression 2. | |
527 | ||
528 | [^(?-['N])] and [^(?+['N])] are relative recursions, so for example [^(?-1)] recurses to the last sub-expression to be declared, | |
529 | and [^(?+1)] recurses to the next sub-expression to be declared. | |
530 | ||
531 | [^(?&NAME)] recurses to named sub-expression ['NAME]. | |
532 | ||
533 | [h5 Conditional Expressions] | |
534 | ||
535 | [^(?(condition)yes-pattern|no-pattern)] attempts to match /yes-pattern/ if | |
536 | the /condition/ is true, otherwise attempts to match /no-pattern/. | |
537 | ||
538 | [^(?(condition)yes-pattern)] attempts to match /yes-pattern/ if the /condition/ | |
539 | is true, otherwise matches the NULL string. | |
540 | ||
541 | /condition/ may be either: a forward lookahead assert, the index of | |
542 | a marked sub-expression (the condition becomes true if the sub-expression | |
543 | has been matched), or an index of a recursion (the condition become true if we are executing | |
544 | directly inside the specified recursion). | |
545 | ||
546 | Here is a summary of the possible predicates: | |
547 | ||
548 | * [^(?(?\=assert)yes-pattern|no-pattern)] Executes /yes-pattern/ if the forward look-ahead assert matches, otherwise | |
549 | executes /no-pattern/. | |
550 | * [^(?(?!assert)yes-pattern|no-pattern)] Executes /yes-pattern/ if the forward look-ahead assert does not match, otherwise | |
551 | executes /no-pattern/. | |
552 | * [^(?(['N])yes-pattern|no-pattern)] Executes /yes-pattern/ if subexpression /N/ has been matched, otherwise | |
553 | executes /no-pattern/. | |
554 | * [^(?(<['name]>)yes-pattern|no-pattern)] Executes /yes-pattern/ if named subexpression /name/ has been matched, otherwise | |
555 | executes /no-pattern/. | |
556 | * [^(?('['name]')yes-pattern|no-pattern)] Executes /yes-pattern/ if named subexpression /name/ has been matched, otherwise | |
557 | executes /no-pattern/. | |
558 | * [^(?(R)yes-pattern|no-pattern)] Executes /yes-pattern/ if we are executing inside a recursion, otherwise | |
559 | executes /no-pattern/. | |
560 | * [^(?(R['N])yes-pattern|no-pattern)] Executes /yes-pattern/ if we are executing inside a recursion to sub-expression /N/, otherwise | |
561 | executes /no-pattern/. | |
562 | * [^(?(R&['name])yes-pattern|no-pattern)] Executes /yes-pattern/ if we are executing inside a recursion to named sub-expression /name/, otherwise | |
563 | executes /no-pattern/. | |
564 | * [^(?(DEFINE)never-exectuted-pattern)] Defines a block of code that is never executed and matches no characters: | |
565 | this is usually used to define one or more named sub-expressions which are referred to from elsewhere in the pattern. | |
566 | ||
567 | [h5 Backtracking Control Verbs] | |
568 | ||
569 | This library has partial support for Perl's backtracking control verbs, in particular (*MARK) is not supported. | |
570 | There may also be detail differences in behaviour between this library and Perl, not least because Perl's behaviour | |
571 | is rather under-documented and often somewhat random in how it behaves in practice. The verbs supported are: | |
572 | ||
573 | * [^(*PRUNE)] Has no effect unless backtracked onto, in which case all the backtracking information prior to this | |
574 | point is discarded. | |
575 | * [^(*SKIP)] Behaves the same as [^(*PRUNE)] except that it is assumed that no match can possibly occur prior to | |
576 | the current point in the string being searched. This can be used to optimize searches by skipping over chunks of text | |
577 | that have already been determined can not form a match. | |
578 | * [^(*THEN)] Has no effect unless backtracked onto, in which case all subsequent alternatives in a group of alternations | |
579 | are discarded. | |
580 | * [^(*COMMIT)] Has no effect unless backtracked onto, in which case all subsequent matching/searching attempts are abandoned. | |
581 | * [^(*FAIL)] Causes the match to fail unconditionally at this point, can be used to force the engine to backtrack. | |
582 | * [^(*ACCEPT)] Causes the pattern to be considered matched at the current point. Any half-open sub-expressions are closed at the current point. | |
583 | ||
584 | [h4 Operator precedence] | |
585 | ||
586 | The order of precedence for of operators is as follows: | |
587 | ||
588 | # Collation-related bracket symbols `[==] [::] [..]` | |
589 | # Escaped characters [^\\] | |
590 | # Character set (bracket expression) `[]` | |
591 | # Grouping [^()] | |
592 | # Single-character-ERE duplication [^* + ? {m,n}] | |
593 | # Concatenation | |
594 | # Anchoring ^$ | |
595 | # Alternation | | |
596 | ||
597 | [h3 What gets matched] | |
598 | ||
599 | If you view the regular expression as a directed (possibly cyclic) | |
600 | graph, then the best match found is the first match found by a | |
601 | depth-first-search performed on that graph, while matching the input text. | |
602 | ||
603 | Alternatively: | |
604 | ||
605 | The best match found is the | |
606 | [link boost_regex.syntax.leftmost_longest_rule leftmost match], | |
607 | with individual elements matched as follows; | |
608 | ||
609 | [table | |
610 | [[Construct][What gets matched]] | |
611 | [[[^AtomA AtomB]][Locates the best match for /AtomA/ that has a following match for /AtomB/.]] | |
612 | [[[^Expression1 | Expression2]][If /Expresion1/ can be matched then returns that match, | |
613 | otherwise attempts to match /Expression2/.]] | |
614 | [[[^S{N}]][Matches /S/ repeated exactly N times.]] | |
615 | [[[^S{N,M}]][Matches S repeated between N and M times, and as many times as possible.]] | |
616 | [[[^S{N,M}?]][Matches S repeated between N and M times, and as few times as possible.]] | |
617 | [[[^S?, S*, S+]][The same as [^S{0,1}], [^S{0,UINT_MAX}], [^S{1,UINT_MAX}] respectively.]] | |
618 | [[[^S??, S*?, S+?]][The same as [^S{0,1}?], [^S{0,UINT_MAX}?], [^S{1,UINT_MAX}?] respectively.]] | |
619 | [[[^(?>S)]][Matches the best match for /S/, and only that.]] | |
620 | [[[^(?=S), (?<=S)]][Matches only the best match for /S/ (this is only | |
621 | visible if there are capturing parenthesis within /S/).]] | |
622 | [[[^(?!S), (?<!S)]][Considers only whether a match for S exists or not.]] | |
623 | [[[^(?(condition)yes-pattern | no-pattern)]][If condition is true, then | |
624 | only yes-pattern is considered, otherwise only no-pattern is considered.]] | |
625 | ] | |
626 | ||
627 | [h3 Variations] | |
628 | ||
629 | The [link boost_regex.ref.syntax_option_type.syntax_option_type_perl options [^normal], | |
630 | [^ECMAScript], [^JavaScript] and [^JScript]] are all synonyms for | |
631 | [^perl]. | |
632 | ||
633 | [h3 Options] | |
634 | ||
635 | There are a [link boost_regex.ref.syntax_option_type.syntax_option_type_perl | |
636 | variety of flags] that may be combined with the [^perl] option when | |
637 | constructing the regular expression, in particular note that the | |
638 | [^newline_alt] option alters the syntax, while the [^collate], [^nosubs] and | |
639 | [^icase] options modify how the case and locale sensitivity are to be applied. | |
640 | ||
641 | [h3 Pattern Modifiers] | |
642 | ||
643 | The perl [^smix] modifiers can either be applied using a [^(?smix-smix)] | |
644 | prefix to the regular expression, or with one of the | |
645 | [link boost_regex.ref.syntax_option_type.syntax_option_type_perl regex-compile time | |
646 | flags [^no_mod_m], [^mod_x], [^mod_s], and [^no_mod_s]]. | |
647 | ||
648 | [h3 References] | |
649 | ||
650 | [@http://perldoc.perl.org/perlre.html Perl 5.8]. | |
651 | ||
652 | ||
653 | [endsect] | |
654 | ||
655 |