[ceph.git] / ceph / src / boost / libs / locale / doc / boundary_analysys.txt

// vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 filetype=cpp.doxygen

//
//  Copyright (c) 2009-2011 Artyom Beilis (Tonkikh)
//
//  Distributed under the Boost Software License, Version 1.0. (See
//  accompanying file LICENSE_1_0.txt or copy at
//  http://www.boost.org/LICENSE_1_0.txt)
//

/*!
\page boundary_analysys Boundary analysis

- \ref boundary_analysys_basics
- \ref boundary_analysys_segments
    - \ref boundary_analysys_segments_basics
    - \ref boundary_analysys_segments_rules
    - \ref boundary_analysys_segments_search
- \ref boundary_analysys_break
    - \ref boundary_analysys_break_basics
    - \ref boundary_analysys_break_rules
    - \ref boundary_analysys_break_search


\section boundary_analysys_basics Basics

Boost.Locale provides a boundary analysis tool, allowing you to split text into characters,
words, or sentences, and find appropriate places for line breaks. 

\note This task is not a trivial task.
\par
A Unicode code point and a character are not equivalent, for example: 
Hebrew word Shalom - "שָלוֹם" that consists of 4 characters and 6 code points (4 base letters and 2 diacritical marks)
\par
Words may not be separated by space characters in some languages like in Japanese or Chinese.

Boost.Locale provides 2 major classes for boundary analysis:

-   \ref boost::locale::boundary::segment_index - an object that holds an index of segments in the text (like words, characters,
    sentences). It provides an access to \ref boost::locale::boundary::segment "segment" objects via iterators.
-   \ref boost::locale::boundary::boundary_point_index - an object that holds an index of boundary points in the text.
    It allows to iterate over the \ref boost::locale::boundary::boundary_point "boundary_point" objects.

Each of the classes above use an iterator type as template parameter. 
Both of these classes accept in their constructor:

- A flag that defines boundary analysis \ref boost::locale::boundary::boundary_type "boundary_type".
- The pair of iterators that define the text range that should be analysed
- A locale parameter (if not given the global one is used)

For example:
\code
namespace ba=boost::locale::boundary;
std::string text= ... ;
std::locale loc = ... ;
ba::segment_index<std::string::const_iterator> map(ba::word,text.begin(),text.end(),loc);
\endcode

Each of them provide a members \c begin(), \c end() and \c find() that allow to iterate
over the selected segments or boundaries in the text or find a location of a segment or
boundary for given iterator.


Convenience a typedefs like \ref boost::locale::boundary::ssegment_index "ssegment_index"
or \ref boost::locale::boundary::wcboundary_point_index "wcboundary_point_index" provided as well,
where "w", "u16" and "u32" prefixes define a character type \c wchar_t,
\c char16_t and \c char32_t and "c" and "s" prefixes define whether <tt>std::basic_string<CharType>::const_iterator</tt>
or <tt>CharType const *</tt> are used.

\section boundary_analysys_segments Iterating Over Segments
\section boundary_analysys_segments_basics Basic Iteration

The text segments analysis is done using \ref boost::locale::boundary::segment_index "segment_index" class.

It provides a bidirectional iterator that returns \ref boost::locale::boundary::segment "segment" object.
The segment object represents a pair of iterators that define this segment and a rule according to which it was selected.
It can be automatically converted to \c std::basic_string object.

To perform boundary analysis, we first create an index object and then iterate over it:

For example:

\code
using namespace boost::locale::boundary;
boost::locale::generator gen;
std::string text="To be or not to be, that is the question."
// Create mapping of text for token iterator using global locale.
ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8")); 
// Print all "words" -- chunks of word boundary
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
    std::cout <<"\""<< * it << "\", ";
std::cout << std::endl;
\endcode

Would print:

\verbatim
"To", " ", "be", " ", "or", " ", "not", " ", "to", " ", "be", ",", " ", "that", " ", "is", " ", "the", " ", "question", ".",
\endverbatim

This sentence "生きるか死ぬか、それが問題だ。" (<a href="http://tatoeba.org/eng/sentences/show/868189">from Tatoeba database</a>)
would be split into following segments in \c ja_JP.UTF-8 (Japanese) locale:

\verbatim
"生", "きるか", "死", "ぬか", "、", "それが", "問題", "だ", "。", 
\endverbatim

The boundary analysis that is done by Boost.Locale
is much more complicated then just splitting the text according
to white space characters, even thou it is not perfect. 


\section boundary_analysys_segments_rules Using Rules

The segments selection can be customized using \ref boost::locale::boundary::segment_index::rule(rule_type) "rule()" and
\ref boost::locale::boundary::segment_index::full_select(bool) "full_select()" member functions.

By default segment_index's iterator return each text segment defined by two boundary points regardless
the way they were selected. Thus in the example above we could see text segments like "." or " " 
that were selected as words.

Using a \c rule() member function we can specify a binary mask of rules we want to use for selection of
the boundary points using \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line"
and \ref bl_boundary_sentence_rules "sentence" boundary rules.

For example, by calling

\code
map.rule(word_any);
\endcode

Before starting the iteration process, specify a selection mask that fetches: numbers, letter, Kana letters and
ideographic characters ignoring all non-word related characters like white space or punctuation marks.

So the code:

\code
using namespace boost::locale::boundary;
std::string text="To be or not to be, that is the question."
// Create mapping of text for token iterator using global locale.
ssegment_index map(word,text.begin(),text.end()); 
// Define a rule
map.rule(word_any);
// Print all "words" -- chunks of word boundary
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
    std::cout <<"\""<< * it << "\", ";
std::cout << std::endl;
\endcode

Would print:

\verbatim
"To", "be", "or", "not", "to", "be", "that", "is", "the", "question",
\endverbatim

And the for given text="生きるか死ぬか、それが問題だ。" and rule(\ref boost::locale::boundary::word_ideo "word_ideo"), the example above would print.

\verbatim
"生", "死", "問題",
\endverbatim

You can access specific rules the segments where selected it using \ref boost::locale::boundary::segment::rule() "segment::rule()" member
function. Using a bit-mask of rules.

For example:

\code
boost::locale::generator gen;
using namespace boost::locale::boundary;
std::string text="生きるか死ぬか、それが問題だ。";
ssegment_index map(word,text.begin(),text.end(),gen("ja_JP.UTF-8")); 
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it) {
    std::cout << "Segment " << *it << " contains: ";
    if(it->rule() & word_none)
        std::cout << "white space or punctuation marks ";
    if(it->rule() & word_kana)
        std::cout << "kana characters ";
    if(it->rule() & word_ideo)
        std::cout << "ideographic characters";
    std::cout<< std::endl;
}
\endcode

Would print

\verbatim
Segment 生 contains: ideographic characters
Segment きるか contains: kana characters 
Segment 死 contains: ideographic characters
Segment ぬか contains: kana characters 
Segment 、 contains: white space or punctuation marks 
Segment それが contains: kana characters 
Segment 問題 contains: ideographic characters
Segment だ contains: kana characters 
Segment 。 contains: white space or punctuation marks 
\endverbatim

One important things that should be noted that each segment is defined
by a pair of boundaries and the rule of its ending point defines
if it is selected or not.

In some cases it may be not what we actually look like.

For example we have a text:

\verbatim
Hello! How
are you?
\endverbatim

And we want to fetch all sentences from the text.

The \ref bl_boundary_sentence_rules "sentence rules" have two options:

- Split the text on the point where sentence terminator like ".!?" detected: \ref boost::locale::boundary::sentence_term "sentence_term"
- Split the text on the point where sentence separator like "line feed" detected: \ref boost::locale::boundary::sentence_sep "sentence_sep"

Naturally to ignore sentence separators we would call \ref boost::locale::boundary::segment_index::rule(rule_type v) "segment_index::rule(rule_type v)"
with sentence_term parameter and then run the iterator.

\code
boost::locale::generator gen;
using namespace boost::locale::boundary;
std::string text=   "Hello! How\n"
                    "are you?\n";
ssegment_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8")); 
map.rule(sentence_term);
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it) 
    std::cout << "Sentence [" << *it << "]" << std::endl;
\endcode

However we would get the expected segments:
\verbatim
Sentence [Hello! ]
Sentence [are you?
]
\endverbatim

The reason is that "How\n" is still considered a sentence but selected by different
rule.

This behavior can be changed by setting \ref boost::locale::boundary::segment_index::full_select(bool) "segment_index::full_select(bool)"
to \c true. It would force iterator to join the current segment with all previous segments that may not fit the required rule.

So we add this line:

\code
map.full_select(true);
\endcode

Right after "map.rule(sentence_term);" and get expected output:

\verbatim
Sentence [Hello! ]
Sentence [How
are you?
]
\endverbatim

\subsection boundary_analysys_segments_search Locating Segments

Sometimes it is useful to find a segment that some specific iterator is pointing on.

For example a user had clicked at specific point, we want to select a word on this
location.

\ref boost::locale::boundary::segment_index "segment_index" provides 
\ref boost::locale::boundary::segment_index::find() "find(base_iterator p)" 
member function for this purpose.

This function returns the iterator to the segmet such that \a p points to.


For example:

\code
text="to be or ";
ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
ssegment_index::iterator  p = map.find(text.begin() + 4);
if(p!=map.end())
    std::cout << *p << std::endl;
\endcode

Would print:

\verbatim
be
\endverbatim

\note 

if the iterator lays inside the segment this segment returned. If the segment does
not fit the selection rules, then the segment following requested position
is returned.

For example: For \ref boost::locale::boundary::word "word" boundary analysis with \ref boost::locale::boundary::word_any "word_any" rule:

- "t|o be or ", would point to "to" - the iterator in the middle of segment "to".
- "to |be or ", would point to "be" - the iterator at the beginning of the segment "be"
- "to| be or ", would point to "be" - the iterator does is not point to segment with required rule so next valid segment is selected "be".
- "to be or| ", would point to end as not valid segment found.


\section boundary_analysys_break Iterating Over Boundary Points
\section boundary_analysys_break_basics Basic Iteration

The \ref boost::locale::boundary::boundary_point_index "boundary_point_index" is similar  to 
\ref boost::locale::boundary::segment_index "segment_index" in its interface but as a different role.
Instead of returning text chunks (\ref boost::locale::boundary::segment "segment"s, it returns
\ref boost::locale::boundary::boundary_point "boundary_point" object that
represents a position in text - a base iterator used that is used for 
iteration of the source text C++ characters. 
The \ref boost::locale::boundary::boundary_point "boundary_point" object
also provides a \ref boost::locale::boundary::boundary_point::rule() "rule()" member
function that defines a rule this boundary was selected according to.

\note The beginning and the ending of the text are considered boundary points, so even
an empty text consists of at least one boundary point.

Lets see an example of selecting first two sentences from a text:

\code
using namespace boost::locale::boundary;
boost::locale::generator gen;

// our text sample
std::string const text="First sentence. Second sentence! Third one?";
// Create an index 
sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));

// Count two boundary points
sboundary_point_index::iterator p = map.begin(),e=map.end();
int count = 0;
while(p!=e && count < 2) {
    ++count;
    ++p;
}

if(p!=e) {
    std::cout   << "First two sentences are: " 
                << std::string(text.begin(),p->iterator()) 
                << std::endl;
}
else {
    std::cout   <<"There are less then two sentences in this "
                <<"text: " << text << std::endl;
}\endcode

Would print:

\verbatim
First two sentences are: First sentence. Second sentence!
\endverbatim

\section boundary_analysys_break_rules Using Rules

Similarly to the \ref boost::locale::boundary::segment_index "segment_index" the 
\ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides
a \ref boost::locale::boundary::boundary_point_index::rule(rule_type r) "rule(rule_type mask)"
member function to filter boundary points that interest us.

It allows to set \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line"
and \ref bl_boundary_sentence_rules "sentence" rules for filtering boundary points.

Lets change an example above a little:

\code
// our text sample
std::string const text= "First sentence. Second\n"
                        "sentence! Third one?";
\endcode

If we run our program as is on the sample above we would get:
\verbatim
First two sentences are: First sentence. Second
\endverbatim

Which is not something that we really expected. As the "Second\n"
is considered an independent sentence that was separated by
a line separator "Line Feed".

However, we can set set a rule \ref boost::locale::boundary::sentence_term "sentence_term" 
and the iterator would use only boundary points that are created
by a sentence terminators like ".!?".

So by adding:
\code
map.rule(sentence_term);
\endcode

Right after the generation of the index we would get the desired output:

\verbatim
First two sentences are: First sentence. Second
sentence! 
\endverbatim

You can also use \ref boost::locale::boundary::boundary_point::rule() "boundary_point::rule()" member
function to learn about the reason this boundary point was created by comparing it with an appropriate
mask.

For example:

\code
using namespace boost::locale::boundary;
boost::locale::generator gen;
// our text sample
std::string const text= "First sentence. Second\n"
                        "sentence! Third one?";
sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));

for(sboundary_point_index::iterator p = map.begin(),e=map.end();p!=e;++p) {
    if(p->rule() & sentence_term)
        std::cout << "There is a sentence terminator: ";
    else if(p->rule() & sentence_sep)
        std::cout << "There is a sentence separator: ";
    if(p->rule()!=0) // print if some rule exists
        std::cout   << "[" << std::string(text.begin(),p->iterator()) 
                    << "|" << std::string(p->iterator(),text.end()) 
                    << "]\n";
}
\endcode

Would give the following output:
\verbatim
There is a sentence terminator: [First sentence. |Second
sentence! Third one?]
There is a sentence separator: [First sentence. Second
|sentence! Third one?]
There is a sentence terminator: [First sentence. Second
sentence! |Third one?]
There is a sentence terminator: [First sentence. Second
sentence! Third one?|]
\endverbatim

\subsection boundary_analysys_break_search Locating Boundary Points

Sometimes it is useful to find a specific boundary point according to given
iterator.

\ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides
a \ref boost::locale::boundary::boundary_point_index::find() "iterator find(base_iterator p)" member
function.

It would return an iterator to a boundary point on \a p's location or at the
location following it if \a p does not point to appropriate position.

For example, for word boundary analysis:

- If a base iterator points to "to |be", then the returned boundary point would be "to |be" (same position)
- If a base iterator points to "t|o be", then the returned boundary point would be "to| be" (next valid position)

For example if we want to select 6 words around specific boundary point we can use following code:

\code
using namespace boost::locale::boundary;
boost::locale::generator gen;
// our text sample
std::string const text= "To be or not to be, that is the question.";

// Create a mapping
sboundary_point_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
// Ignore wite space
map.rule(word_any);

// define our arbitraty point
std::string::const_iterator pos = text.begin() + 12; // "no|t";

// Get the search range
sboundary_point_index::iterator 
    begin =map.begin(),
    end = map.end(),
    it = map.find(pos); // find a boundary

// go 3 words backward
for(int count = 0;count <3 && it!=begin; count ++) 
    --it;

// Save the start
std::string::const_iterator start = *it;

// go 6 words forward
for(int count = 0;count < 6 && it!=end; count ++)
    ++it;

// make sure we at valid position
if(it==end)
    --it;

// print the text
std::cout << std::string(start,it->iterator()) << std::endl;
\endcode

That would print:

\verbatim
 be or not to be, that
\endverbatim


*/
Commit	Line	Data
7c673cae FG	1	// vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 filetype=cpp.doxygen
	2
	3	//
	4	// Copyright (c) 2009-2011 Artyom Beilis (Tonkikh)
	5	//
	6	// Distributed under the Boost Software License, Version 1.0. (See
	7	// accompanying file LICENSE_1_0.txt or copy at
	8	// http://www.boost.org/LICENSE_1_0.txt)
	9	//
	10
	11	/*!
	12	\page boundary_analysys Boundary analysis
	13
	14	- \ref boundary_analysys_basics
	15	- \ref boundary_analysys_segments
	16	- \ref boundary_analysys_segments_basics
	17	- \ref boundary_analysys_segments_rules
	18	- \ref boundary_analysys_segments_search
	19	- \ref boundary_analysys_break
	20	- \ref boundary_analysys_break_basics
	21	- \ref boundary_analysys_break_rules
	22	- \ref boundary_analysys_break_search
	23
	24
	25	\section boundary_analysys_basics Basics
	26
	27	Boost.Locale provides a boundary analysis tool, allowing you to split text into characters,
	28	words, or sentences, and find appropriate places for line breaks.
	29
	30	\note This task is not a trivial task.
	31	\par
	32	A Unicode code point and a character are not equivalent, for example:
	33	Hebrew word Shalom - "שָלוֹם" that consists of 4 characters and 6 code points (4 base letters and 2 diacritical marks)
	34	\par
	35	Words may not be separated by space characters in some languages like in Japanese or Chinese.
	36
	37	Boost.Locale provides 2 major classes for boundary analysis:
	38
	39	- \ref boost::locale::boundary::segment_index - an object that holds an index of segments in the text (like words, characters,
	40	sentences). It provides an access to \ref boost::locale::boundary::segment "segment" objects via iterators.
	41	- \ref boost::locale::boundary::boundary_point_index - an object that holds an index of boundary points in the text.
	42	It allows to iterate over the \ref boost::locale::boundary::boundary_point "boundary_point" objects.
	43
	44	Each of the classes above use an iterator type as template parameter.
	45	Both of these classes accept in their constructor:
	46
	47	- A flag that defines boundary analysis \ref boost::locale::boundary::boundary_type "boundary_type".
	48	- The pair of iterators that define the text range that should be analysed
	49	- A locale parameter (if not given the global one is used)
	50
	51	For example:
	52	\code
	53	namespace ba=boost::locale::boundary;
	54	std::string text= ... ;
	55	std::locale loc = ... ;
	56	ba::segment_index<std::string::const_iterator> map(ba::word,text.begin(),text.end(),loc);
	57	\endcode
	58
	59	Each of them provide a members \c begin(), \c end() and \c find() that allow to iterate
	60	over the selected segments or boundaries in the text or find a location of a segment or
	61	boundary for given iterator.
	62
	63
	64	Convenience a typedefs like \ref boost::locale::boundary::ssegment_index "ssegment_index"
65	or \ref boost::locale::boundary::wcboundary_point_index "wcboundary_point_index" provided as well,
66	where "w", "u16" and "u32" prefixes define a character type \c wchar_t,
67	\c char16_t and \c char32_t and "c" and "s" prefixes define whether <tt>std::basic_string<CharType>::const_iterator</tt>
68	or <tt>CharType const *</tt> are used.
69
70	\section boundary_analysys_segments Iterating Over Segments
71	\section boundary_analysys_segments_basics Basic Iteration
72
73	The text segments analysis is done using \ref boost::locale::boundary::segment_index "segment_index" class.
74
75	It provides a bidirectional iterator that returns \ref boost::locale::boundary::segment "segment" object.
76	The segment object represents a pair of iterators that define this segment and a rule according to which it was selected.
77	It can be automatically converted to \c std::basic_string object.
78
79	To perform boundary analysis, we first create an index object and then iterate over it:
80
81	For example:
82
83	\code
84	using namespace boost::locale::boundary;
85	boost::locale::generator gen;
86	std::string text="To be or not to be, that is the question."
87	// Create mapping of text for token iterator using global locale.
88	ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
89	// Print all "words" -- chunks of word boundary
90	for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
91	std::cout <<"\""<< * it << "\", ";
92	std::cout << std::endl;
93	\endcode
94
95	Would print:
96
97	\verbatim
98	"To", " ", "be", " ", "or", " ", "not", " ", "to", " ", "be", ",", " ", "that", " ", "is", " ", "the", " ", "question", ".",
99	\endverbatim
100
101	This sentence "生きるか死ぬか、それが問題だ。" (<a href="http://tatoeba.org/eng/sentences/show/868189">from Tatoeba database</a>)
102	would be split into following segments in \c ja_JP.UTF-8 (Japanese) locale:
103
104	\verbatim
105	"生", "きるか", "死", "ぬか", "、", "それが", "問題", "だ", "。",
106	\endverbatim
107
108	The boundary analysis that is done by Boost.Locale
109	is much more complicated then just splitting the text according
110	to white space characters, even thou it is not perfect.
111
112
113	\section boundary_analysys_segments_rules Using Rules
114
115	The segments selection can be customized using \ref boost::locale::boundary::segment_index::rule(rule_type) "rule()" and
116	\ref boost::locale::boundary::segment_index::full_select(bool) "full_select()" member functions.
117
118	By default segment_index's iterator return each text segment defined by two boundary points regardless
119	the way they were selected. Thus in the example above we could see text segments like "." or " "
120	that were selected as words.
121
122	Using a \c rule() member function we can specify a binary mask of rules we want to use for selection of
123	the boundary points using \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line"
124	and \ref bl_boundary_sentence_rules "sentence" boundary rules.
125
126	For example, by calling
127
128	\code
129	map.rule(word_any);
130	\endcode
131
132	Before starting the iteration process, specify a selection mask that fetches: numbers, letter, Kana letters and
133	ideographic characters ignoring all non-word related characters like white space or punctuation marks.
134
135	So the code:
136
137	\code
138	using namespace boost::locale::boundary;
139	std::string text="To be or not to be, that is the question."
140	// Create mapping of text for token iterator using global locale.
141	ssegment_index map(word,text.begin(),text.end());
142	// Define a rule
143	map.rule(word_any);
144	// Print all "words" -- chunks of word boundary
145	for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
146	std::cout <<"\""<< * it << "\", ";
147	std::cout << std::endl;
148	\endcode
149
150	Would print:
151
152	\verbatim
153	"To", "be", "or", "not", "to", "be", "that", "is", "the", "question",
154	\endverbatim
155
156	And the for given text="生きるか死ぬか、それが問題だ。" and rule(\ref boost::locale::boundary::word_ideo "word_ideo"), the example above would print.
157
158	\verbatim
159	"生", "死", "問題",
160	\endverbatim
161
162	You can access specific rules the segments where selected it using \ref boost::locale::boundary::segment::rule() "segment::rule()" member
163	function. Using a bit-mask of rules.
164
165	For example:
166
167	\code
168	boost::locale::generator gen;
169	using namespace boost::locale::boundary;
170	std::string text="生きるか死ぬか、それが問題だ。";
171	ssegment_index map(word,text.begin(),text.end(),gen("ja_JP.UTF-8"));
172	for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it) {
173	std::cout << "Segment " << *it << " contains: ";
174	if(it->rule() & word_none)
175	std::cout << "white space or punctuation marks ";
176	if(it->rule() & word_kana)
177	std::cout << "kana characters ";
178	if(it->rule() & word_ideo)
179	std::cout << "ideographic characters";
180	std::cout<< std::endl;
181	}
182	\endcode
183
184	Would print
185
186	\verbatim
187	Segment 生 contains: ideographic characters
188	Segment きるか contains: kana characters
189	Segment 死 contains: ideographic characters
190	Segment ぬか contains: kana characters
191	Segment 、 contains: white space or punctuation marks
192	Segment それが contains: kana characters
193	Segment 問題 contains: ideographic characters
194	Segment だ contains: kana characters
195	Segment 。 contains: white space or punctuation marks
196	\endverbatim
197
198	One important things that should be noted that each segment is defined
199	by a pair of boundaries and the rule of its ending point defines
200	if it is selected or not.
201
202	In some cases it may be not what we actually look like.
203
204	For example we have a text:
205
206	\verbatim
207	Hello! How
208	are you?
209	\endverbatim
210
211	And we want to fetch all sentences from the text.
212
213	The \ref bl_boundary_sentence_rules "sentence rules" have two options:
214
215	- Split the text on the point where sentence terminator like ".!?" detected: \ref boost::locale::boundary::sentence_term "sentence_term"
216	- Split the text on the point where sentence separator like "line feed" detected: \ref boost::locale::boundary::sentence_sep "sentence_sep"
217
218	Naturally to ignore sentence separators we would call \ref boost::locale::boundary::segment_index::rule(rule_type v) "segment_index::rule(rule_type v)"
219	with sentence_term parameter and then run the iterator.
220
221	\code
222	boost::locale::generator gen;
223	using namespace boost::locale::boundary;
224	std::string text= "Hello! How\n"
225	"are you?\n";
226	ssegment_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));
227	map.rule(sentence_term);
228	for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
229	std::cout << "Sentence [" << *it << "]" << std::endl;
230	\endcode
231
232	However we would get the expected segments:
233	\verbatim
234	Sentence [Hello! ]
235	Sentence [are you?
236	]
237	\endverbatim
238
239	The reason is that "How\n" is still considered a sentence but selected by different
240	rule.
241
242	This behavior can be changed by setting \ref boost::locale::boundary::segment_index::full_select(bool) "segment_index::full_select(bool)"
243	to \c true. It would force iterator to join the current segment with all previous segments that may not fit the required rule.
244
245	So we add this line:
246
247	\code
248	map.full_select(true);
249	\endcode
250
251	Right after "map.rule(sentence_term);" and get expected output:
252
253	\verbatim
254	Sentence [Hello! ]
255	Sentence [How
256	are you?
257	]
258	\endverbatim
259
260	\subsection boundary_analysys_segments_search Locating Segments
261
262	Sometimes it is useful to find a segment that some specific iterator is pointing on.
263
264	For example a user had clicked at specific point, we want to select a word on this
265	location.
266
267	\ref boost::locale::boundary::segment_index "segment_index" provides
268	\ref boost::locale::boundary::segment_index::find() "find(base_iterator p)"
269	member function for this purpose.
270
271	This function returns the iterator to the segmet such that \a p points to.
272
273
274	For example:
275
276	\code
277	text="to be or ";
278	ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
279	ssegment_index::iterator p = map.find(text.begin() + 4);
280	if(p!=map.end())
281	std::cout << *p << std::endl;
282	\endcode
283
284	Would print:
285
286	\verbatim
287	be
288	\endverbatim
289
290	\note
291
292	if the iterator lays inside the segment this segment returned. If the segment does
293	not fit the selection rules, then the segment following requested position
294	is returned.
295
296	For example: For \ref boost::locale::boundary::word "word" boundary analysis with \ref boost::locale::boundary::word_any "word_any" rule:
297
298	- "t\|o be or ", would point to "to" - the iterator in the middle of segment "to".
299	- "to \|be or ", would point to "be" - the iterator at the beginning of the segment "be"
300	- "to\| be or ", would point to "be" - the iterator does is not point to segment with required rule so next valid segment is selected "be".
301	- "to be or\| ", would point to end as not valid segment found.
302
303
304	\section boundary_analysys_break Iterating Over Boundary Points
305	\section boundary_analysys_break_basics Basic Iteration
306
307	The \ref boost::locale::boundary::boundary_point_index "boundary_point_index" is similar to
308	\ref boost::locale::boundary::segment_index "segment_index" in its interface but as a different role.
309	Instead of returning text chunks (\ref boost::locale::boundary::segment "segment"s, it returns
310	\ref boost::locale::boundary::boundary_point "boundary_point" object that
311	represents a position in text - a base iterator used that is used for
312	iteration of the source text C++ characters.
313	The \ref boost::locale::boundary::boundary_point "boundary_point" object
314	also provides a \ref boost::locale::boundary::boundary_point::rule() "rule()" member
315	function that defines a rule this boundary was selected according to.
316
317	\note The beginning and the ending of the text are considered boundary points, so even
318	an empty text consists of at least one boundary point.
319
320	Lets see an example of selecting first two sentences from a text:
321
322	\code
323	using namespace boost::locale::boundary;
324	boost::locale::generator gen;
325
326	// our text sample
327	std::string const text="First sentence. Second sentence! Third one?";
328	// Create an index
329	sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));
330
331	// Count two boundary points
332	sboundary_point_index::iterator p = map.begin(),e=map.end();
333	int count = 0;
334	while(p!=e && count < 2) {
335	++count;
336	++p;
337	}
338
339	if(p!=e) {
340	std::cout << "First two sentences are: "
341	<< std::string(text.begin(),p->iterator())
342	<< std::endl;
343	}
344	else {
345	std::cout <<"There are less then two sentences in this "
346	<<"text: " << text << std::endl;
347	}\endcode
348
349	Would print:
350
351	\verbatim
352	First two sentences are: First sentence. Second sentence!
353	\endverbatim
354
355	\section boundary_analysys_break_rules Using Rules
356
357	Similarly to the \ref boost::locale::boundary::segment_index "segment_index" the
358	\ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides
359	a \ref boost::locale::boundary::boundary_point_index::rule(rule_type r) "rule(rule_type mask)"
360	member function to filter boundary points that interest us.
361
362	It allows to set \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line"
363	and \ref bl_boundary_sentence_rules "sentence" rules for filtering boundary points.
364
365	Lets change an example above a little:
366
367	\code
368	// our text sample
369	std::string const text= "First sentence. Second\n"
370	"sentence! Third one?";
371	\endcode
372
373	If we run our program as is on the sample above we would get:
374	\verbatim
375	First two sentences are: First sentence. Second
376	\endverbatim
377
378	Which is not something that we really expected. As the "Second\n"
379	is considered an independent sentence that was separated by
380	a line separator "Line Feed".
381
382	However, we can set set a rule \ref boost::locale::boundary::sentence_term "sentence_term"
383	and the iterator would use only boundary points that are created
384	by a sentence terminators like ".!?".
385
386	So by adding:
387	\code
388	map.rule(sentence_term);
389	\endcode
390
391	Right after the generation of the index we would get the desired output:
392
393	\verbatim
394	First two sentences are: First sentence. Second
395	sentence!
396	\endverbatim
397
398	You can also use \ref boost::locale::boundary::boundary_point::rule() "boundary_point::rule()" member
399	function to learn about the reason this boundary point was created by comparing it with an appropriate
400	mask.
401
402	For example:
403
404	\code
405	using namespace boost::locale::boundary;
406	boost::locale::generator gen;
407	// our text sample
408	std::string const text= "First sentence. Second\n"
409	"sentence! Third one?";
410	sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));
411
412	for(sboundary_point_index::iterator p = map.begin(),e=map.end();p!=e;++p) {
413	if(p->rule() & sentence_term)
414	std::cout << "There is a sentence terminator: ";
415	else if(p->rule() & sentence_sep)
416	std::cout << "There is a sentence separator: ";
417	if(p->rule()!=0) // print if some rule exists
418	std::cout << "[" << std::string(text.begin(),p->iterator())
419	<< "\|" << std::string(p->iterator(),text.end())
420	<< "]\n";
421	}
422	\endcode
423
424	Would give the following output:
425	\verbatim
426	There is a sentence terminator: [First sentence. \|Second
427	sentence! Third one?]
428	There is a sentence separator: [First sentence. Second
429	\|sentence! Third one?]
430	There is a sentence terminator: [First sentence. Second
431	sentence! \|Third one?]
432	There is a sentence terminator: [First sentence. Second
433	sentence! Third one?\|]
434	\endverbatim
435
436	\subsection boundary_analysys_break_search Locating Boundary Points
437
438	Sometimes it is useful to find a specific boundary point according to given
439	iterator.
440
441	\ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides
442	a \ref boost::locale::boundary::boundary_point_index::find() "iterator find(base_iterator p)" member
443	function.
444
445	It would return an iterator to a boundary point on \a p's location or at the
446	location following it if \a p does not point to appropriate position.
447
448	For example, for word boundary analysis:
449
450	- If a base iterator points to "to \|be", then the returned boundary point would be "to \|be" (same position)
451	- If a base iterator points to "t\|o be", then the returned boundary point would be "to\| be" (next valid position)
452
453	For example if we want to select 6 words around specific boundary point we can use following code:
454
455	\code
456	using namespace boost::locale::boundary;
457	boost::locale::generator gen;
458	// our text sample
459	std::string const text= "To be or not to be, that is the question.";
460
461	// Create a mapping
462	sboundary_point_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
463	// Ignore wite space
464	map.rule(word_any);
465
466	// define our arbitraty point
467	std::string::const_iterator pos = text.begin() + 12; // "no\|t";
468
469	// Get the search range
470	sboundary_point_index::iterator
471	begin =map.begin(),
472	end = map.end(),
473	it = map.find(pos); // find a boundary
474
475	// go 3 words backward
476	for(int count = 0;count <3 && it!=begin; count ++)
477	--it;
478
479	// Save the start
480	std::string::const_iterator start = *it;
481
482	// go 6 words forward
483	for(int count = 0;count < 6 && it!=end; count ++)
484	++it;
485
486	// make sure we at valid position
487	if(it==end)
488	--it;
489
490	// print the text
491	std::cout << std::string(start,it->iterator()) << std::endl;
492	\endcode
493
494	That would print:
495
496	\verbatim
497	be or not to be, that
498	\endverbatim
499
500
501	*/
502
503