]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | // |
2 | // Copyright (c) 2009-2011 Artyom Beilis (Tonkikh) | |
3 | // | |
4 | // Distributed under the Boost Software License, Version 1.0. (See | |
5 | // accompanying file LICENSE_1_0.txt or copy at | |
6 | // http://www.boost.org/LICENSE_1_0.txt) | |
7 | // | |
8 | ||
9 | // vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 filetype=cpp.doxygen | |
10 | /*! | |
11 | \page conversions Text Conversions | |
12 | ||
13 | There is a set of functions that perform basic string conversion operations: | |
14 | upper, lower and \ref term_title_case "title case" conversions, \ref term_case_folding "case folding" | |
15 | and Unicode \ref term_normalization "normalization". These are \ref boost::locale::to_upper "to_upper" , \ref boost::locale::to_lower "to_lower", \ref boost::locale::to_title "to_title", \ref boost::locale::fold_case "fold_case" and \ref boost::locale::normalize "normalize". | |
16 | ||
17 | All these functions receive an \c std::locale object as parameter or use a global locale by default. | |
18 | ||
19 | Global locale is used in all examples below. | |
20 | ||
21 | \section conversions_case Case Handing | |
22 | ||
23 | For example: | |
24 | \code | |
25 | std::string grussen = "grüßEN"; | |
26 | std::cout <<"Upper "<< boost::locale::to_upper(grussen) << std::endl | |
27 | <<"Lower "<< boost::locale::to_lower(grussen) << std::endl | |
28 | <<"Title "<< boost::locale::to_title(grussen) << std::endl | |
29 | <<"Fold "<< boost::locale::fold_case(grussen) << std::endl; | |
30 | \endcode | |
31 | ||
32 | Would print: | |
33 | ||
34 | \verbatim | |
35 | Upper GRÜSSEN | |
36 | Lower grüßen | |
37 | Title Grüßen | |
38 | Fold grüssen | |
39 | \endverbatim | |
40 | ||
41 | You may notice that there are existing functions \c to_upper and \c to_lower in the Boost.StringAlgo library. | |
42 | The difference is that these function operate over an entire string instead of performing incorrect character-by-character conversions. | |
43 | ||
44 | For example: | |
45 | ||
46 | \code | |
47 | std::wstring grussen = L"grüßen"; | |
48 | std::wcout << boost::algorithm::to_upper_copy(grussen) << " " << boost::locale::to_upper(grussen) << std::endl; | |
49 | \endcode | |
50 | ||
51 | Would give in output: | |
52 | ||
53 | \verbatim | |
54 | GRÜßEN GRÜSSEN | |
55 | \endverbatim | |
56 | ||
57 | Where a letter "ß" was not converted correctly to double-S in first case because of a limitation of \c std::ctype facet. | |
58 | ||
59 | This is even more problematic in case of UTF-8 encodings where non US-ASCII are not converted at all. | |
60 | For example, this code | |
61 | ||
62 | \code | |
63 | std::string grussen = "grüßen"; | |
64 | std::cout << boost::algorithm::to_upper_copy(grussen) << " " << boost::locale::to_upper(grussen) << std::endl; | |
65 | \endcode | |
66 | ||
67 | Would modify ASCII characters only | |
68 | ||
69 | \verbatim | |
70 | GRüßEN GRÜSSEN | |
71 | \endverbatim | |
72 | ||
73 | \section conversions_normalization Unicode Normalization | |
74 | ||
75 | Unicode normalization is the process of converting strings to a standard form, suitable for text processing and | |
76 | comparison. For example, character "ü" can be represented by a single code point or a combination of the character "u" and the | |
77 | diaeresis "¨". Normalization is an important part of Unicode text processing. | |
78 | ||
79 | Unicode defines four normalization forms. Each specific form is selected by a flag passed | |
80 | to \ref boost::locale::normalize() "normalize" function: | |
81 | ||
82 | - NFD - Canonical decomposition - boost::locale::norm_nfd | |
83 | - NFC - Canonical decomposition followed by canonical composition - boost::locale::norm_nfc or boost::locale::norm_default | |
84 | - NFKD - Compatibility decomposition - boost::locale::norm_nfkd | |
85 | - NFKC - Compatibility decomposition followed by canonical composition - boost::locale::norm_nfkc | |
86 | ||
87 | For more details on normalization forms, read <a href="http://unicode.org/reports/tr15/#Norm_Forms">this article</a>. | |
88 | ||
89 | \section conversions_notes Notes | |
90 | ||
91 | - \ref boost::locale::normalize() "normalize" operates only on Unicode-encoded strings, i.e.: UTF-8, UTF-16 and UTF-32 depending on the | |
92 | character width. So be careful when using non-UTF encodings as they may be treated incorrectly. | |
93 | - \ref boost::locale::fold_case() "fold_case" is generally a locale-independent operation, but it receives a locale as a parameter to | |
94 | determine the 8-bit encoding. | |
95 | - All of these functions can work with an STL string, a NUL terminated string, or a range defined by two pointers. They always | |
96 | return a newly created STL string. | |
97 | - The length of the string may change, see the above example. | |
98 | */ | |
99 | ||
100 |