1 // vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 filetype=cpp.doxygen
4 // Copyright (c) 2009-2011 Artyom Beilis (Tonkikh)
6 // Distributed under the Boost Software License, Version 1.0. (See
7 // accompanying file LICENSE_1_0.txt or copy at
8 // http://www.boost.org/LICENSE_1_0.txt)
12 \page boundary_analysys Boundary analysis
14 - \ref boundary_analysys_basics
15 - \ref boundary_analysys_segments
16 - \ref boundary_analysys_segments_basics
17 - \ref boundary_analysys_segments_rules
18 - \ref boundary_analysys_segments_search
19 - \ref boundary_analysys_break
20 - \ref boundary_analysys_break_basics
21 - \ref boundary_analysys_break_rules
22 - \ref boundary_analysys_break_search
25 \section boundary_analysys_basics Basics
27 Boost.Locale provides a boundary analysis tool, allowing you to split text into characters,
28 words, or sentences, and find appropriate places for line breaks.
30 \note This task is not a trivial task.
32 A Unicode code point and a character are not equivalent, for example:
33 Hebrew word Shalom - "שָלוֹם" that consists of 4 characters and 6 code points (4 base letters and 2 diacritical marks)
35 Words may not be separated by space characters in some languages like in Japanese or Chinese.
37 Boost.Locale provides 2 major classes for boundary analysis:
39 - \ref boost::locale::boundary::segment_index - an object that holds an index of segments in the text (like words, characters,
40 sentences). It provides an access to \ref boost::locale::boundary::segment "segment" objects via iterators.
41 - \ref boost::locale::boundary::boundary_point_index - an object that holds an index of boundary points in the text.
42 It allows to iterate over the \ref boost::locale::boundary::boundary_point "boundary_point" objects.
44 Each of the classes above use an iterator type as template parameter.
45 Both of these classes accept in their constructor:
47 - A flag that defines boundary analysis \ref boost::locale::boundary::boundary_type "boundary_type".
48 - The pair of iterators that define the text range that should be analysed
49 - A locale parameter (if not given the global one is used)
53 namespace ba=boost::locale::boundary;
54 std::string text= ... ;
55 std::locale loc = ... ;
56 ba::segment_index<std::string::const_iterator> map(ba::word,text.begin(),text.end(),loc);
59 Each of them provide a members \c begin(), \c end() and \c find() that allow to iterate
60 over the selected segments or boundaries in the text or find a location of a segment or
61 boundary for given iterator.
64 Convenience a typedefs like \ref boost::locale::boundary::ssegment_index "ssegment_index"
65 or \ref boost::locale::boundary::wcboundary_point_index "wcboundary_point_index" provided as well,
66 where "w", "u16" and "u32" prefixes define a character type \c wchar_t,
67 \c char16_t and \c char32_t and "c" and "s" prefixes define whether <tt>std::basic_string<CharType>::const_iterator</tt>
68 or <tt>CharType const *</tt> are used.
70 \section boundary_analysys_segments Iterating Over Segments
71 \section boundary_analysys_segments_basics Basic Iteration
73 The text segments analysis is done using \ref boost::locale::boundary::segment_index "segment_index" class.
75 It provides a bidirectional iterator that returns \ref boost::locale::boundary::segment "segment" object.
76 The segment object represents a pair of iterators that define this segment and a rule according to which it was selected.
77 It can be automatically converted to \c std::basic_string object.
79 To perform boundary analysis, we first create an index object and then iterate over it:
84 using namespace boost::locale::boundary;
85 boost::locale::generator gen;
86 std::string text="To be or not to be, that is the question."
87 // Create mapping of text for token iterator using global locale.
88 ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
89 // Print all "words" -- chunks of word boundary
90 for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
91 std::cout <<"\""<< * it << "\", ";
92 std::cout << std::endl;
98 "To", " ", "be", " ", "or", " ", "not", " ", "to", " ", "be", ",", " ", "that", " ", "is", " ", "the", " ", "question", ".",
101 This sentence "生きるか死ぬか、それが問題だ。" (<a href="http://tatoeba.org/eng/sentences/show/868189">from Tatoeba database</a>)
102 would be split into following segments in \c ja_JP.UTF-8 (Japanese) locale:
105 "生", "きるか", "死", "ぬか", "、", "それが", "問題", "だ", "。",
108 The boundary analysis that is done by Boost.Locale
109 is much more complicated then just splitting the text according
110 to white space characters, even thou it is not perfect.
113 \section boundary_analysys_segments_rules Using Rules
115 The segments selection can be customized using \ref boost::locale::boundary::segment_index::rule(rule_type) "rule()" and
116 \ref boost::locale::boundary::segment_index::full_select(bool) "full_select()" member functions.
118 By default segment_index's iterator return each text segment defined by two boundary points regardless
119 the way they were selected. Thus in the example above we could see text segments like "." or " "
120 that were selected as words.
122 Using a \c rule() member function we can specify a binary mask of rules we want to use for selection of
123 the boundary points using \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line"
124 and \ref bl_boundary_sentence_rules "sentence" boundary rules.
126 For example, by calling
132 Before starting the iteration process, specify a selection mask that fetches: numbers, letter, Kana letters and
133 ideographic characters ignoring all non-word related characters like white space or punctuation marks.
138 using namespace boost::locale::boundary;
139 std::string text="To be or not to be, that is the question."
140 // Create mapping of text for token iterator using global locale.
141 ssegment_index map(word,text.begin(),text.end());
144 // Print all "words" -- chunks of word boundary
145 for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
146 std::cout <<"\""<< * it << "\", ";
147 std::cout << std::endl;
153 "To", "be", "or", "not", "to", "be", "that", "is", "the", "question",
156 And the for given text="生きるか死ぬか、それが問題だ。" and rule(\ref boost::locale::boundary::word_ideo "word_ideo"), the example above would print.
162 You can access specific rules the segments where selected it using \ref boost::locale::boundary::segment::rule() "segment::rule()" member
163 function. Using a bit-mask of rules.
168 boost::locale::generator gen;
169 using namespace boost::locale::boundary;
170 std::string text="生きるか死ぬか、それが問題だ。";
171 ssegment_index map(word,text.begin(),text.end(),gen("ja_JP.UTF-8"));
172 for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it) {
173 std::cout << "Segment " << *it << " contains: ";
174 if(it->rule() & word_none)
175 std::cout << "white space or punctuation marks ";
176 if(it->rule() & word_kana)
177 std::cout << "kana characters ";
178 if(it->rule() & word_ideo)
179 std::cout << "ideographic characters";
180 std::cout<< std::endl;
187 Segment 生 contains: ideographic characters
188 Segment きるか contains: kana characters
189 Segment 死 contains: ideographic characters
190 Segment ぬか contains: kana characters
191 Segment 、 contains: white space or punctuation marks
192 Segment それが contains: kana characters
193 Segment 問題 contains: ideographic characters
194 Segment だ contains: kana characters
195 Segment 。 contains: white space or punctuation marks
198 One important things that should be noted that each segment is defined
199 by a pair of boundaries and the rule of its ending point defines
200 if it is selected or not.
202 In some cases it may be not what we actually look like.
204 For example we have a text:
211 And we want to fetch all sentences from the text.
213 The \ref bl_boundary_sentence_rules "sentence rules" have two options:
215 - Split the text on the point where sentence terminator like ".!?" detected: \ref boost::locale::boundary::sentence_term "sentence_term"
216 - Split the text on the point where sentence separator like "line feed" detected: \ref boost::locale::boundary::sentence_sep "sentence_sep"
218 Naturally to ignore sentence separators we would call \ref boost::locale::boundary::segment_index::rule(rule_type v) "segment_index::rule(rule_type v)"
219 with sentence_term parameter and then run the iterator.
222 boost::locale::generator gen;
223 using namespace boost::locale::boundary;
224 std::string text= "Hello! How\n"
226 ssegment_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));
227 map.rule(sentence_term);
228 for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
229 std::cout << "Sentence [" << *it << "]" << std::endl;
232 However we would get the expected segments:
239 The reason is that "How\n" is still considered a sentence but selected by different
242 This behavior can be changed by setting \ref boost::locale::boundary::segment_index::full_select(bool) "segment_index::full_select(bool)"
243 to \c true. It would force iterator to join the current segment with all previous segments that may not fit the required rule.
248 map.full_select(true);
251 Right after "map.rule(sentence_term);" and get expected output:
260 \subsection boundary_analysys_segments_search Locating Segments
262 Sometimes it is useful to find a segment that some specific iterator is pointing on.
264 For example a user had clicked at specific point, we want to select a word on this
267 \ref boost::locale::boundary::segment_index "segment_index" provides
268 \ref boost::locale::boundary::segment_index::find() "find(base_iterator p)"
269 member function for this purpose.
271 This function returns the iterator to the segmet such that \a p points to.
278 ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
279 ssegment_index::iterator p = map.find(text.begin() + 4);
281 std::cout << *p << std::endl;
292 if the iterator lays inside the segment this segment returned. If the segment does
293 not fit the selection rules, then the segment following requested position
296 For example: For \ref boost::locale::boundary::word "word" boundary analysis with \ref boost::locale::boundary::word_any "word_any" rule:
298 - "t|o be or ", would point to "to" - the iterator in the middle of segment "to".
299 - "to |be or ", would point to "be" - the iterator at the beginning of the segment "be"
300 - "to| be or ", would point to "be" - the iterator does is not point to segment with required rule so next valid segment is selected "be".
301 - "to be or| ", would point to end as not valid segment found.
304 \section boundary_analysys_break Iterating Over Boundary Points
305 \section boundary_analysys_break_basics Basic Iteration
307 The \ref boost::locale::boundary::boundary_point_index "boundary_point_index" is similar to
308 \ref boost::locale::boundary::segment_index "segment_index" in its interface but as a different role.
309 Instead of returning text chunks (\ref boost::locale::boundary::segment "segment"s, it returns
310 \ref boost::locale::boundary::boundary_point "boundary_point" object that
311 represents a position in text - a base iterator used that is used for
312 iteration of the source text C++ characters.
313 The \ref boost::locale::boundary::boundary_point "boundary_point" object
314 also provides a \ref boost::locale::boundary::boundary_point::rule() "rule()" member
315 function that defines a rule this boundary was selected according to.
317 \note The beginning and the ending of the text are considered boundary points, so even
318 an empty text consists of at least one boundary point.
320 Lets see an example of selecting first two sentences from a text:
323 using namespace boost::locale::boundary;
324 boost::locale::generator gen;
327 std::string const text="First sentence. Second sentence! Third one?";
329 sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));
331 // Count two boundary points
332 sboundary_point_index::iterator p = map.begin(),e=map.end();
334 while(p!=e && count < 2) {
340 std::cout << "First two sentences are: "
341 << std::string(text.begin(),p->iterator())
345 std::cout <<"There are less then two sentences in this "
346 <<"text: " << text << std::endl;
352 First two sentences are: First sentence. Second sentence!
355 \section boundary_analysys_break_rules Using Rules
357 Similarly to the \ref boost::locale::boundary::segment_index "segment_index" the
358 \ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides
359 a \ref boost::locale::boundary::boundary_point_index::rule(rule_type r) "rule(rule_type mask)"
360 member function to filter boundary points that interest us.
362 It allows to set \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line"
363 and \ref bl_boundary_sentence_rules "sentence" rules for filtering boundary points.
365 Lets change an example above a little:
369 std::string const text= "First sentence. Second\n"
370 "sentence! Third one?";
373 If we run our program as is on the sample above we would get:
375 First two sentences are: First sentence. Second
378 Which is not something that we really expected. As the "Second\n"
379 is considered an independent sentence that was separated by
380 a line separator "Line Feed".
382 However, we can set set a rule \ref boost::locale::boundary::sentence_term "sentence_term"
383 and the iterator would use only boundary points that are created
384 by a sentence terminators like ".!?".
388 map.rule(sentence_term);
391 Right after the generation of the index we would get the desired output:
394 First two sentences are: First sentence. Second
398 You can also use \ref boost::locale::boundary::boundary_point::rule() "boundary_point::rule()" member
399 function to learn about the reason this boundary point was created by comparing it with an appropriate
405 using namespace boost::locale::boundary;
406 boost::locale::generator gen;
408 std::string const text= "First sentence. Second\n"
409 "sentence! Third one?";
410 sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));
412 for(sboundary_point_index::iterator p = map.begin(),e=map.end();p!=e;++p) {
413 if(p->rule() & sentence_term)
414 std::cout << "There is a sentence terminator: ";
415 else if(p->rule() & sentence_sep)
416 std::cout << "There is a sentence separator: ";
417 if(p->rule()!=0) // print if some rule exists
418 std::cout << "[" << std::string(text.begin(),p->iterator())
419 << "|" << std::string(p->iterator(),text.end())
424 Would give the following output:
426 There is a sentence terminator: [First sentence. |Second
427 sentence! Third one?]
428 There is a sentence separator: [First sentence. Second
429 |sentence! Third one?]
430 There is a sentence terminator: [First sentence. Second
431 sentence! |Third one?]
432 There is a sentence terminator: [First sentence. Second
433 sentence! Third one?|]
436 \subsection boundary_analysys_break_search Locating Boundary Points
438 Sometimes it is useful to find a specific boundary point according to given
441 \ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides
442 a \ref boost::locale::boundary::boundary_point_index::find() "iterator find(base_iterator p)" member
445 It would return an iterator to a boundary point on \a p's location or at the
446 location following it if \a p does not point to appropriate position.
448 For example, for word boundary analysis:
450 - If a base iterator points to "to |be", then the returned boundary point would be "to |be" (same position)
451 - If a base iterator points to "t|o be", then the returned boundary point would be "to| be" (next valid position)
453 For example if we want to select 6 words around specific boundary point we can use following code:
456 using namespace boost::locale::boundary;
457 boost::locale::generator gen;
459 std::string const text= "To be or not to be, that is the question.";
462 sboundary_point_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
466 // define our arbitraty point
467 std::string::const_iterator pos = text.begin() + 12; // "no|t";
469 // Get the search range
470 sboundary_point_index::iterator
473 it = map.find(pos); // find a boundary
475 // go 3 words backward
476 for(int count = 0;count <3 && it!=begin; count ++)
480 std::string::const_iterator start = *it;
482 // go 6 words forward
483 for(int count = 0;count < 6 && it!=end; count ++)
486 // make sure we at valid position
491 std::cout << std::string(start,it->iterator()) << std::endl;
497 be or not to be, that