[ceph.git] / ceph / src / boost / libs / regex / doc / introduction.qbk

[/ 
  Copyright 2006-2007 John Maddock.
  Distributed under the Boost Software License, Version 1.0.
  (See accompanying file LICENSE_1_0.txt or copy at
  http://www.boost.org/LICENSE_1_0.txt).
]

[section Introduction and Overview]

Regular expressions are a form of pattern-matching that are often used in 
text processing; many users will be familiar with the Unix utilities grep, sed  
and awk, and the programming language Perl, each of which make extensive use 
of regular expressions. Traditionally C++ users have been limited to the 
POSIX C API's for manipulating regular expressions, and while Boost.Regex does 
provide these API's, they do not represent the best way to use the library. 
For example Boost.Regex can cope with wide character strings, or search and 
replace operations (in a manner analogous to either sed or Perl), something 
that traditional C libraries can not do.

The class [basic_regex] is the key class in this library; it represents a 
"machine readable" regular expression, and is very closely modeled on 
`std::basic_string`, think of it as a string plus the actual state-machine 
required by the regular expression algorithms. Like `std::basic_string` there 
are two typedefs that are almost always the means by which this class is referenced:

   namespace boost{

   template <class charT, 
            class traits = regex_traits<charT> >
   class basic_regex;

   typedef basic_regex<char> regex;
   typedef basic_regex<wchar_t> wregex;

   }

To see how this library can be used, imagine that we are writing a credit 
card processing application. Credit card numbers generally come as a string 
of 16-digits, separated into groups of 4-digits, and separated by either a 
space or a hyphen. Before storing a credit card number in a database 
(not necessarily something your customers will appreciate!), we may want to 
verify that the number is in the correct format. To match any digit we could 
use the regular expression \[0-9\], however ranges of characters like this are 
actually locale dependent. Instead we should use the POSIX standard 
form \[\[:digit:\]\], or the Boost.Regex and Perl shorthand for this \\d (note 
that many older libraries tended to be hard-coded to the C-locale, 
consequently this was not an issue for them). That leaves us with the 
following regular expression to validate credit card number formats:

[pre (\d{4}\[- \]){3}\d{4}]

Here the parenthesis act to group (and mark for future reference) 
sub-expressions, and the {4} means "repeat exactly 4 times". This is an 
example of the extended regular expression syntax used by Perl, awk and egrep. 
Boost.Regex also supports the older "basic" syntax used by sed and grep, 
but this is generally less useful, unless you already have some basic regular 
expressions that you need to reuse.

Now let's take that expression and place it in some C++ code to validate the 
format of a credit card number:

   bool validate_card_format(const std::string& s)
   {
      static const boost::regex e("(\\d{4}[- ]){3}\\d{4}");
      return regex_match(s, e);
   }

Note how we had to add some extra escapes to the expression: remember that 
the escape is seen once by the C++ compiler, before it gets to be seen by 
the regular expression engine, consequently escapes in regular expressions 
have to be doubled up when embedding them in C/C++ code. Also note that 
all the examples assume that your compiler supports argument-dependent 
lookup, if yours doesn't (for example VC6), then you will have to add some 
`boost::` prefixes to some of the function calls in the examples.

Those of you who are familiar with credit card processing, will have realized 
that while the format used above is suitable for human readable card numbers, 
it does not represent the format required by online credit card systems; these 
require the number as a string of 16 (or possibly 15) digits, without any 
intervening spaces. What we need is a means to convert easily between the two 
formats, and this is where search and replace comes in. Those who are familiar 
with the utilities sed and Perl will already be ahead here; we need two 
strings - one a regular expression - the other a "format string" that provides 
a description of the text to replace the match with. In Boost.Regex this 
search and replace operation is performed with the algorithm [regex_replace], 
for our credit card example we can write two algorithms like this to 
provide the format conversions:

   // match any format with the regular expression:
   const boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z");
   const std::string machine_format("\\1\\2\\3\\4");
   const std::string human_format("\\1-\\2-\\3-\\4");

   std::string machine_readable_card_number(const std::string s)
   {
      return regex_replace(s, e, machine_format, boost::match_default | boost::format_sed);
   }

   std::string human_readable_card_number(const std::string s)
   {
      return regex_replace(s, e, human_format, boost::match_default | boost::format_sed);
   }

Here we've used marked sub-expressions in the regular expression to split out 
the four parts of the card number as separate fields, the format string then 
uses the sed-like syntax to replace the matched text with the reformatted version.

In the examples above, we haven't directly manipulated the results of 
a regular expression match, however in general the result of a match contains 
a number of sub-expression matches in addition to the overall match. When the 
library needs to report a regular expression match it does so using an instance 
of the class [match_results], as before there are typedefs of this class for 
the most common cases:

   namespace boost{

   typedef match_results<const char*>                  cmatch;
   typedef match_results<const wchar_t*>               wcmatch;
   typedef match_results<std::string::const_iterator>  smatch;
   typedef match_results<std::wstring::const_iterator> wsmatch; 

   }

The algorithms [regex_search] and [regex_match] make use of [match_results] 
to report what matched; the difference between these algorithms is that 
[regex_match] will only find matches that consume /all/ of the input text, 
where as [regex_search] will search for a match anywhere within the text being matched.

Note that these algorithms are not restricted to searching regular C-strings, 
any bidirectional iterator type can be searched, allowing for the 
possibility of seamlessly searching almost any kind of data.

For search and replace operations, in addition to the algorithm [regex_replace] 
that we have already seen, the [match_results] class has a `format` member that 
takes the result of a match and a format string, and produces a new string 
by merging the two.

For iterating through all occurrences of an expression within a text, 
there are two iterator types: [regex_iterator] will enumerate over the 
[match_results] objects found, while [regex_token_iterator] will enumerate 
a series of strings (similar to perl style split operations).

For those that dislike templates, there is a high level wrapper class 
[RegEx] that is an encapsulation of the lower level template code - it 
provides a simplified interface for those that don't need the full 
power of the library, and supports only narrow characters, and the 
"extended" regular expression syntax. This class is now deprecated as 
it does not form part of the regular expressions C++ standard library proposal.

The POSIX API functions: [regcomp], [regexec], [regfree] and [regerr], 
are available in both narrow character and Unicode versions, and are 
provided for those who need compatibility with these API's.

Finally, note that the library now has 
[link boost_regex.background_information.locale run-time localization support], 
and recognizes the full POSIX regular expression syntax - including 
advanced features like multi-character collating elements and equivalence 
classes - as well as providing compatibility with other regular expression 
libraries including GNU and BSD4 regex packages, PCRE and Perl 5. 

[endsect]
Commit	Line	Data
7c673cae FG	1	[/
	2	Copyright 2006-2007 John Maddock.
	3	Distributed under the Boost Software License, Version 1.0.
	4	(See accompanying file LICENSE_1_0.txt or copy at
	5	http://www.boost.org/LICENSE_1_0.txt).
	6	]
	7
	8	[section Introduction and Overview]
	9
	10	Regular expressions are a form of pattern-matching that are often used in
	11	text processing; many users will be familiar with the Unix utilities grep, sed
	12	and awk, and the programming language Perl, each of which make extensive use
	13	of regular expressions. Traditionally C++ users have been limited to the
	14	POSIX C API's for manipulating regular expressions, and while Boost.Regex does
	15	provide these API's, they do not represent the best way to use the library.
	16	For example Boost.Regex can cope with wide character strings, or search and
	17	replace operations (in a manner analogous to either sed or Perl), something
	18	that traditional C libraries can not do.
	19
	20	The class [basic_regex] is the key class in this library; it represents a
	21	"machine readable" regular expression, and is very closely modeled on
	22	`std::basic_string`, think of it as a string plus the actual state-machine
	23	required by the regular expression algorithms. Like `std::basic_string` there
	24	are two typedefs that are almost always the means by which this class is referenced:
	25
	26	namespace boost{
	27
	28	template <class charT,
	29	class traits = regex_traits<charT> >
	30	class basic_regex;
	31
	32	typedef basic_regex<char> regex;
	33	typedef basic_regex<wchar_t> wregex;
	34
	35	}
	36
	37	To see how this library can be used, imagine that we are writing a credit
	38	card processing application. Credit card numbers generally come as a string
	39	of 16-digits, separated into groups of 4-digits, and separated by either a
	40	space or a hyphen. Before storing a credit card number in a database
	41	(not necessarily something your customers will appreciate!), we may want to
	42	verify that the number is in the correct format. To match any digit we could
	43	use the regular expression \[0-9\], however ranges of characters like this are
	44	actually locale dependent. Instead we should use the POSIX standard
	45	form \[\[:digit:\]\], or the Boost.Regex and Perl shorthand for this \\d (note
	46	that many older libraries tended to be hard-coded to the C-locale,
	47	consequently this was not an issue for them). That leaves us with the
	48	following regular expression to validate credit card number formats:
	49
	50	[pre (\d{4}\[- \]){3}\d{4}]
	51
	52	Here the parenthesis act to group (and mark for future reference)
	53	sub-expressions, and the {4} means "repeat exactly 4 times". This is an
	54	example of the extended regular expression syntax used by Perl, awk and egrep.
	55	Boost.Regex also supports the older "basic" syntax used by sed and grep,
	56	but this is generally less useful, unless you already have some basic regular
	57	expressions that you need to reuse.
	58
	59	Now let's take that expression and place it in some C++ code to validate the
	60	format of a credit card number:
	61
	62	bool validate_card_format(const std::string& s)
	63	{
	64	static const boost::regex e("(\\d{4}[- ]){3}\\d{4}");
65	return regex_match(s, e);
66	}
67
68	Note how we had to add some extra escapes to the expression: remember that
69	the escape is seen once by the C++ compiler, before it gets to be seen by
70	the regular expression engine, consequently escapes in regular expressions
71	have to be doubled up when embedding them in C/C++ code. Also note that
72	all the examples assume that your compiler supports argument-dependent
73	lookup, if yours doesn't (for example VC6), then you will have to add some
74	`boost::` prefixes to some of the function calls in the examples.
75
76	Those of you who are familiar with credit card processing, will have realized
77	that while the format used above is suitable for human readable card numbers,
78	it does not represent the format required by online credit card systems; these
79	require the number as a string of 16 (or possibly 15) digits, without any
80	intervening spaces. What we need is a means to convert easily between the two
81	formats, and this is where search and replace comes in. Those who are familiar
82	with the utilities sed and Perl will already be ahead here; we need two
83	strings - one a regular expression - the other a "format string" that provides
84	a description of the text to replace the match with. In Boost.Regex this
85	search and replace operation is performed with the algorithm [regex_replace],
86	for our credit card example we can write two algorithms like this to
87	provide the format conversions:
88
89	// match any format with the regular expression:
90	const boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z");
91	const std::string machine_format("\\1\\2\\3\\4");
92	const std::string human_format("\\1-\\2-\\3-\\4");
93
94	std::string machine_readable_card_number(const std::string s)
95	{
96	return regex_replace(s, e, machine_format, boost::match_default \| boost::format_sed);
97	}
98
99	std::string human_readable_card_number(const std::string s)
100	{
101	return regex_replace(s, e, human_format, boost::match_default \| boost::format_sed);
102	}
103
104	Here we've used marked sub-expressions in the regular expression to split out
105	the four parts of the card number as separate fields, the format string then
106	uses the sed-like syntax to replace the matched text with the reformatted version.
107
108	In the examples above, we haven't directly manipulated the results of
109	a regular expression match, however in general the result of a match contains
110	a number of sub-expression matches in addition to the overall match. When the
111	library needs to report a regular expression match it does so using an instance
112	of the class [match_results], as before there are typedefs of this class for
113	the most common cases:
114
115	namespace boost{
116
117	typedef match_results<const char*> cmatch;
118	typedef match_results<const wchar_t*> wcmatch;
119	typedef match_results<std::string::const_iterator> smatch;
120	typedef match_results<std::wstring::const_iterator> wsmatch;
121
122	}
123
124	The algorithms [regex_search] and [regex_match] make use of [match_results]
125	to report what matched; the difference between these algorithms is that
126	[regex_match] will only find matches that consume /all/ of the input text,
127	where as [regex_search] will search for a match anywhere within the text being matched.
128
129	Note that these algorithms are not restricted to searching regular C-strings,
130	any bidirectional iterator type can be searched, allowing for the
131	possibility of seamlessly searching almost any kind of data.
132
133	For search and replace operations, in addition to the algorithm [regex_replace]
134	that we have already seen, the [match_results] class has a `format` member that
135	takes the result of a match and a format string, and produces a new string
136	by merging the two.
137
138	For iterating through all occurrences of an expression within a text,
139	there are two iterator types: [regex_iterator] will enumerate over the
140	[match_results] objects found, while [regex_token_iterator] will enumerate
141	a series of strings (similar to perl style split operations).
142
143	For those that dislike templates, there is a high level wrapper class
144	[RegEx] that is an encapsulation of the lower level template code - it
145	provides a simplified interface for those that don't need the full
146	power of the library, and supports only narrow characters, and the
147	"extended" regular expression syntax. This class is now deprecated as
148	it does not form part of the regular expressions C++ standard library proposal.
149
150	The POSIX API functions: [regcomp], [regexec], [regfree] and [regerr],
151	are available in both narrow character and Unicode versions, and are
152	provided for those who need compatibility with these API's.
153
154	Finally, note that the library now has
155	[link boost_regex.background_information.locale run-time localization support],
156	and recognizes the full POSIX regular expression syntax - including
157	advanced features like multi-character collating elements and equivalence
158	classes - as well as providing compatibility with other regular expression
159	libraries including GNU and BSD4 regex packages, PCRE and Perl 5.
160
161	[endsect]
162
163