[ceph.git] / ceph / src / boost / libs / xpressive / doc / tokenization.qbk

[/
 / Copyright (c) 2008 Eric Niebler
 /
 / Distributed under the Boost Software License, Version 1.0. (See accompanying
 / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
 /]

[section String Splitting and Tokenization]

_regex_token_iterator_ is the Ginsu knife of the text manipulation world. It slices! It dices! This section describes
how to use the highly-configurable _regex_token_iterator_ to chop up input sequences.

[h2 Overview]

You initialize a _regex_token_iterator_ with an input sequence, a regex, and some optional configuration parameters.
The _regex_token_iterator_ will use _regex_search_ to find the first place in the sequence that the regex matches. When
dereferenced, the _regex_token_iterator_ returns a ['token] in the form of a `std::basic_string<>`. Which string it returns
depends on the configuration parameters. By default it returns a string corresponding to the full match, but it could also
return a string corresponding to a particular marked sub-expression, or even the part of the sequence that ['didn't] match.
When you increment the _regex_token_iterator_, it will move to the next token. Which token is next depends on the configuration
parameters. It could simply be a different marked sub-expression in the current match, or it could be part or all of the
next match. Or it could be the part that ['didn't] match.

As you can see, _regex_token_iterator_ can do a lot. That makes it hard to describe, but some examples should make it clear.

[h2 Example 1: Simple Tokenization]

This example uses _regex_token_iterator_ to chop a sequence into a series of tokens consisting of words.

    std::string input("This is his face");
    sregex re = +_w;                      // find a word

    // iterate over all the words in the input
    sregex_token_iterator begin( input.begin(), input.end(), re ), end;

    // write all the words to std::cout
    std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
    std::copy( begin, end, out_iter );

This program displays the following:

[pre
This
is
his
face
]

[h2 Example 2: Simple Tokenization, Reloaded]

This example also uses _regex_token_iterator_ to chop a sequence into a series of tokens consisting of words,
but it uses the regex as a delimiter. When we pass a `-1` as the last parameter to the _regex_token_iterator_
constructor, it instructs the token iterator to consider as tokens those parts of the input that ['didn't]
match the regex.

    std::string input("This is his face");
    sregex re = +_s;                      // find white space

    // iterate over all non-white space in the input. Note the -1 below:
    sregex_token_iterator begin( input.begin(), input.end(), re, -1 ), end;

    // write all the words to std::cout
    std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
    std::copy( begin, end, out_iter );

This program displays the following:

[pre
This
is
his
face
]

[h2 Example 3: Simple Tokenization, Revolutions]

This example also uses _regex_token_iterator_ to chop a sequence containing a bunch of dates into a series of
tokens consisting of just the years. When we pass a positive integer [^['N]] as the last parameter to the
_regex_token_iterator_ constructor, it instructs the token iterator to consider as tokens only the [^['N]]-th
marked sub-expression of each match.

    std::string input("01/02/2003 blahblah 04/23/1999 blahblah 11/13/1981");
    sregex re = sregex::compile("(\\d{2})/(\\d{2})/(\\d{4})"); // find a date

    // iterate over all the years in the input. Note the 3 below, corresponding to the 3rd sub-expression:
    sregex_token_iterator begin( input.begin(), input.end(), re, 3 ), end;

    // write all the words to std::cout
    std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
    std::copy( begin, end, out_iter );

This program displays the following:

[pre
2003
1999
1981
]

[h2 Example 4: Not-So-Simple Tokenization]

This example is like the previous one, except that instead of tokenizing just the years, this program
turns the days, months and years into tokens. When we pass an array of integers [^['{I,J,...}]] as the last
parameter to the _regex_token_iterator_ constructor, it instructs the token iterator to consider as tokens the
[^['I]]-th, [^['J]]-th, etc. marked sub-expression of each match.

    std::string input("01/02/2003 blahblah 04/23/1999 blahblah 11/13/1981");
    sregex re = sregex::compile("(\\d{2})/(\\d{2})/(\\d{4})"); // find a date

    // iterate over the days, months and years in the input
    int const sub_matches[] = { 2, 1, 3 }; // day, month, year
    sregex_token_iterator begin( input.begin(), input.end(), re, sub_matches ), end;

    // write all the words to std::cout
    std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
    std::copy( begin, end, out_iter );

This program displays the following:

[pre
02
01
2003
23
04
1999
13
11
1981
]

The `sub_matches` array instructs the _regex_token_iterator_ to first take the value of the 2nd sub-match, then
the 1st sub-match, and finally the 3rd. Incrementing the iterator again instructs it to use _regex_search_ again
to find the next match. At that point, the process repeats -- the token iterator takes the value of the 2nd
sub-match, then the 1st, et cetera.

[endsect]
Commit	Line	Data
7c673cae FG	1	[/
	2	/ Copyright (c) 2008 Eric Niebler
	3	/
	4	/ Distributed under the Boost Software License, Version 1.0. (See accompanying
	5	/ file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
	6	/]
	7
	8	[section String Splitting and Tokenization]
	9
	10	_regex_token_iterator_ is the Ginsu knife of the text manipulation world. It slices! It dices! This section describes
	11	how to use the highly-configurable _regex_token_iterator_ to chop up input sequences.
	12
	13	[h2 Overview]
	14
	15	You initialize a _regex_token_iterator_ with an input sequence, a regex, and some optional configuration parameters.
	16	The _regex_token_iterator_ will use _regex_search_ to find the first place in the sequence that the regex matches. When
	17	dereferenced, the _regex_token_iterator_ returns a ['token] in the form of a `std::basic_string<>`. Which string it returns
	18	depends on the configuration parameters. By default it returns a string corresponding to the full match, but it could also
	19	return a string corresponding to a particular marked sub-expression, or even the part of the sequence that ['didn't] match.
	20	When you increment the _regex_token_iterator_, it will move to the next token. Which token is next depends on the configuration
	21	parameters. It could simply be a different marked sub-expression in the current match, or it could be part or all of the
	22	next match. Or it could be the part that ['didn't] match.
	23
	24	As you can see, _regex_token_iterator_ can do a lot. That makes it hard to describe, but some examples should make it clear.
	25
	26	[h2 Example 1: Simple Tokenization]
	27
	28	This example uses _regex_token_iterator_ to chop a sequence into a series of tokens consisting of words.
	29
	30	std::string input("This is his face");
	31	sregex re = +_w; // find a word
	32
	33	// iterate over all the words in the input
	34	sregex_token_iterator begin( input.begin(), input.end(), re ), end;
	35
	36	// write all the words to std::cout
	37	std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
	38	std::copy( begin, end, out_iter );
	39
	40	This program displays the following:
	41
	42	[pre
	43	This
	44	is
	45	his
	46	face
	47	]
	48
	49	[h2 Example 2: Simple Tokenization, Reloaded]
	50
	51	This example also uses _regex_token_iterator_ to chop a sequence into a series of tokens consisting of words,
	52	but it uses the regex as a delimiter. When we pass a `-1` as the last parameter to the _regex_token_iterator_
	53	constructor, it instructs the token iterator to consider as tokens those parts of the input that ['didn't]
	54	match the regex.
	55
	56	std::string input("This is his face");
	57	sregex re = +_s; // find white space
	58
	59	// iterate over all non-white space in the input. Note the -1 below:
	60	sregex_token_iterator begin( input.begin(), input.end(), re, -1 ), end;
	61
	62	// write all the words to std::cout
	63	std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
	64	std::copy( begin, end, out_iter );
65
66	This program displays the following:
67
68	[pre
69	This
70	is
71	his
72	face
73	]
74
75	[h2 Example 3: Simple Tokenization, Revolutions]
76
77	This example also uses _regex_token_iterator_ to chop a sequence containing a bunch of dates into a series of
78	tokens consisting of just the years. When we pass a positive integer [^['N]] as the last parameter to the
79	_regex_token_iterator_ constructor, it instructs the token iterator to consider as tokens only the [^['N]]-th
80	marked sub-expression of each match.
81
82	std::string input("01/02/2003 blahblah 04/23/1999 blahblah 11/13/1981");
83	sregex re = sregex::compile("(\\d{2})/(\\d{2})/(\\d{4})"); // find a date
84
85	// iterate over all the years in the input. Note the 3 below, corresponding to the 3rd sub-expression:
86	sregex_token_iterator begin( input.begin(), input.end(), re, 3 ), end;
87
88	// write all the words to std::cout
89	std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
90	std::copy( begin, end, out_iter );
91
92	This program displays the following:
93
94	[pre
95	2003
96	1999
97	1981
98	]
99
100	[h2 Example 4: Not-So-Simple Tokenization]
101
102	This example is like the previous one, except that instead of tokenizing just the years, this program
103	turns the days, months and years into tokens. When we pass an array of integers [^['{I,J,...}]] as the last
104	parameter to the _regex_token_iterator_ constructor, it instructs the token iterator to consider as tokens the
105	[^['I]]-th, [^['J]]-th, etc. marked sub-expression of each match.
106
107	std::string input("01/02/2003 blahblah 04/23/1999 blahblah 11/13/1981");
108	sregex re = sregex::compile("(\\d{2})/(\\d{2})/(\\d{4})"); // find a date
109
110	// iterate over the days, months and years in the input
111	int const sub_matches[] = { 2, 1, 3 }; // day, month, year
112	sregex_token_iterator begin( input.begin(), input.end(), re, sub_matches ), end;
113
114	// write all the words to std::cout
115	std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
116	std::copy( begin, end, out_iter );
117
118	This program displays the following:
119
120	[pre
121	02
122	01
123	2003
124	23
125	04
126	1999
127	13
128	11
129	1981
130	]
131
132	The `sub_matches` array instructs the _regex_token_iterator_ to first take the value of the 2nd sub-match, then
133	the 1st sub-match, and finally the 3rd. Incrementing the iterator again instructs it to use _regex_search_ again
134	to find the next match. At that point, the process repeats -- the token iterator takes the value of the 2nd
135	sub-match, then the 1st, et cetera.
136
137	[endsect]