ceph/src/boost/libs/locale/doc/rationale.txt

   1 //
   2 //  Copyright (c) 2009-2011 Artyom Beilis (Tonkikh)
   3 //
   4 //  Distributed under the Boost Software License, Version 1.0. (See
   5 //  accompanying file LICENSE_1_0.txt or copy at
   6 //  http://www.boost.org/LICENSE_1_0.txt)
   7 //
   8
   9 // vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 filetype=cpp.doxygen
  10 /*!
  11
  12 \page rationale Design Rationale
  13
  14 - \ref rationale_why
  15 - \ref why_icu
  16 - \ref why_icu_wrapper
  17 - \ref why_icu_api_is_hidden
  18 - \ref why_gnu_gettext
  19 - \ref why_posix_names
  20 - \ref why_linear_chunks
  21 - \ref why_abstract_api
  22 - \ref why_no_special_character_type
  23
  24 \section rationale_why Why is it needed?
  25
  26 Why do we need a localization library, when standard C++ facets (should) provide most of the required functionality:
  27
  28 - Case conversion is done using the \c std::ctype facet
  29 - Collation is supported by \c std::collate and has nice integration with \c std::locale
  30 - There are \c std::num_put , \c std::num_get , \c std::money_put , \c std::money_get , \c std::time_put and \c std::time_get for numbers,
  31     time, and currency formatting and parsing.
  32 - There is a \c std::messages class that supports localized message formatting.
  33
  34 So why do we need such library if we have all the functionality within the standard library?
  35
  36 Almost every(!) facet has design flaws:
  37
  38 -  \c std::collate supports only one level of collation, not allowing you to choose whether case- or accent-sensitive comparisons
  39     should be performed.
  40
  41 -  \c std::ctype, which is responsible for case conversion, assumes that all conversions can be done on a per-character basis. This is
  42     probably correct for many languages but it isn't correct in general.
  43     \n
  44     -# Case conversion may change a string's length. For example, the German word "grüßen" should be converted to "GRÜSSEN" in upper
  45     case: the letter "ß" should be converted to "SS", but the \c toupper function works on a single-character basis.
  46     -# Case conversion is context-sensitive. For example, the Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the Greek letter
  47     "Σ" is converted to "σ" or to "ς", depending on its position in the word.
  48     -# Case conversion cannot assume that a character is a single code point, which is incorrect for both the UTF-8 and UTF-16 encodings,
  49        where individual code-points are represented by up to 4 \c char 's or two \c wchar_t 's on the Windows platform. This makes
  50        \c std::ctype totally useless with these encodings.
  51 -   \c std::numpunct and \c std::moneypunct do not specify the code points for digit representation at all,
  52     so they cannot format numbers with the digits used under Arabic locales. For example,
  53     the number "103" is expected to be displayed as "١٠٣" in the \c ar_EG locale.
  54     \n
  55     \c std::numpunct and \c std::moneypunct assume that the thousands separator is a single character. This is untrue
  56     for the UTF-8 encoding where only Unicode 0-0x7F range can be represented as a single character. As a result, localized numbers can't be
  57     represented correctly under locales that use the Unicode "EN SPACE" character for the thousands separator, such as Russian.
  58     \n
  59     This actually causes real problems under GCC and SunStudio compilers, where formatting numbers under a Russian locale creates invalid
  60     UTF-8 sequences.
  61 -   \c std::time_put and \c std::time_get have several flaws:
  62     -# They assume that the calendar is always Gregorian, by using \c std::tm for time representation, ignoring the fact that in many
  63        countries dates may be displayed using different calendars.
  64     -# They always use a global time zone, not allowing specification of the time zone for formatting. The standard \c std::tm doesn't
  65        even include a timezone field at all.
  66     -# \c std::time_get is not symmetric with \c std::time_put, so you cannot parse dates and times created with \c std::time_put .
  67        (This issue is addressed in C++0x and some STL implementation like the Apache standard C++ library.)
  68 -   \c std::messages does not provide support for plural forms, making it impossible to correctly localize such simple strings as
  69        "There are X files in the directory".
  70
  71 Also, many features are not really supported by \c std::locale at all: timezones (as mentioned above), text boundary analysis, number
  72 spelling, and many others. So it is clear that the standard C++ locales are problematic for real-world applications.
  73
  74 \section why_icu Why use an ICU wrapper instead of ICU?
  75
  76 ICU is a very good localization library, but it has several serious flaws:
  77
  78 - It is absolutely unfriendly to C++ developers. It ignores popular C++ idioms (the STL, RTTI, exceptions, etc), instead
  79 mostly mimicking the Java API.
  80 - It provides support for only one kind of string, UTF-16, when some users may want other Unicode encodings.
  81 For example, for XML or HTML processing UTF-8 is much more convenient and UTF-32 easier to use. Also there is no support for
  82 "narrow" encodings that are still very popular, such as the ISO-8859 encodings.
  83
  84 For example: Boost.Locale provides direct integration with \c iostream allowing a more natural way of data formatting. For example:
  85
  86 \code
  87     cout << "You have "<<as::currency << 134.45 << " in your account as of "<<as::datetime << std::time(0) << endl;
  88 \endcode
  89
  90 \section why_icu_wrapper Why an ICU wrapper and not an implementation-from-scratch?
  91
  92 ICU is one of the best localization/Unicode libraries available. It consists of about half a million lines of well-tested,
  93 production-proven source code that today provides state-of-the art localization tools.
  94
  95 Reimplementing of even a small part of ICU's abilities is an infeasible project which would require many man-years. So the
  96 question is not whether we need to reimplement the Unicode and localization algorithms from scratch, but "Do we need a good
  97 localization library in Boost?"
  98
  99 Thus Boost.Locale wraps ICU with a modern C++ interface, allowing future reimplementation of parts with better alternatives,
 100 but bringing localization support to Boost today and not in the not-so-near-if-at-all future.
 101
 102
 103 \section why_icu_api_is_hidden Why is the ICU API not exposed to the user?
 104
 105 Yes, the entire ICU API is hidden behind opaque pointers and users have no access to it. This is done for several reasons:
 106
 107 - At some point, better localization tools may be accepted by future upcoming C++ standards, so they may not use ICU directly.
 108 - At some point, it should be possible to switch the underlying localization engine to something else, maybe the native operating
 109 system API or some other toolkit such as GLib or Qt that provides similar functionality.
 110 - Not all localization is done within ICU. For example, message formatting uses GNU Gettext message catalogs. In the future more
 111 functionality may be reimplemented directly in the Boost.Locale library.
 112 - Boost.Locale was designed with ABI stability in mind, as this library is being developed not only for Boost but also
 113 for the needs of the <a href="http://cppcms.sourceforge.net/">CppCMS C++ Web framework</a>.
 114
 115
 116 \section why_gnu_gettext Why use GNU Gettext catalogs for message formatting?
 117
 118 There are many available localization formats. The most popular so far are OASIS XLIFF, GNU gettext po/mo files, POSIX catalogs, Qt ts/tm files, Java properties, and Windows resources. However, the last three are useful only in their specific areas, and POSIX catalogs are too simple and limited, so there are only two reasonable options:
 119
 120 -# Standard localization format OASIS XLIFF.
 121 -# GNU Gettext binary catalogs.
 122
 123 The first one generally seems like a more correct localization solution, but it requires XML parsing for loading documents,
 124 it is very complicated format, and even ICU requires preliminary compilation of it into ICU resource bundles.
 125
 126 On the other hand:
 127
 128 - GNU Gettext binary catalogs have a very simple, robust and yet very useful file format.
 129 - It is at present the most popular and de-facto standard localization format (at least in the Open Source world).
 130 - It has very simple and powerful support for plural forms.
 131 - It uses the original English text as the key, making the process of internationalization much easier because at least
 132 one basic translation is always available.
 133 - There are many tools for editing and managing gettext catalogs, such as Poedit, kbabel etc.
 134
 135 So, even though the GNU Gettext mo catalog format is not an officially approved file format:
 136
 137 - It is a de-facto standard and the most popular one.
 138 - Its implementation is much easier and does not require XML parsing and validation.
 139
 140
 141 \note Boost.Locale does not use any of the GNU Gettext code, it just reimplements the tool for reading and using mo-files,
 142 eliminating the biggest GNU Gettext flaw at present -- thread safety when using multiple locales.
 143
 144 \section why_plain_number Why is a plain number used for the representation of a date-time, instead of a Boost.DateTime date or Boost.DateTime ptime?
 145
 146 There are several reasons:
 147
 148 -#  A Gregorian Date by definition can't be used to represent locale-independent dates, because not all
 149     calendars are Gregorian.
 150 -#  \c ptime -- definitely could be used, but it has several problems:
 151     \n
 152     -   It is created in GMT or Local time clock, when `time()` gives a representation that is independent of time zones
 153         (usually GMT time), and only later should it be represented in a time zone that the user requests.
 154         \n
 155         The timezone is not a property of time itself, but it is rather a property of time formatting.
 156         \n
 157     -   \c ptime already defines \c operator<< and \c operator>> for time formatting and parsing.
 158     -   The existing facets for \c ptime formatting and parsing were not designed in a way that the user can override.
 159         The major formatting and parsing functions are not virtual. This makes it impossible to reimplement the formatting and
 160         parsing functions of \c ptime unless the developers of the Boost.DateTime library decide to change them.
 161         \n
 162         Also, the facets of \c ptime are not "correctly" designed in terms of division of formatting information and
 163         locale information. Formatting information should be stored within \c std::ios_base and information about
 164         locale-specific formatting should be stored in the facet itself.
 165         \n
 166         The user of the library should not have to create new facets to change simple formatting information like "display only
 167         the date" or "display both date and time."
 168
 169 Thus, at this point, \c ptime is not supported for formatting localized dates and times.
 170
 171 \section why_posix_names Why are POSIX locale names used and not something like the BCP-47 IETF language tag?
 172
 173 There are several reasons:
 174
 175 - POSIX locale names have a very important feature: character encoding. When you specify for example fr-FR, you
 176 do not actually know how the text should be encoded -- UTF-8, ISO-8859-1, ISO-8859-15 or maybe Windows-1252.
 177 This may vary between different operating systems and depends on the current installation. So it is critical
 178 to provide all the required information.
 179 - ICU fully understands POSIX locales and knows how to treat them correctly.
 180 - They are native locale names for most operating system APIs (with the exception of Windows)
 181
 182 \section why_linear_chunks Why most parts of Boost.Locale work only on linear/contiguous chunks of text
 183
 184 There are two reasons:
 185
 186 - Boost.Locale relies heavily on the third-party APIs like ICU, POSIX or Win32 API, all of them
 187   work only on linear chunks of text, so providing non-linear API would just hide the
 188   real situation and would not bring real performance advantage.
 189 - In fact, all known libraries that work with Unicode: ICU, Qt, Glib, Win32 API, POSIX API
 190   and others accept an input as single linear chunk of text and there is a good reason for this:
 191   \n
 192   -#  Most of supported operations on text like collation, case handling usually work on small
 193       chunks of text. For example: you probably would never want to compare two chapters of a book, but rather
 194       their titles.
 195   -#  We should remember that even very large texts require quite a small amount of memory, for example
 196       the entire book "War and Peace" takes only about 3MB of memory.
 197    \n
 198
 199 However:
 200
 201 -  There are API's that support stream processing. For example: character set conversion using
 202 \c std::codecvt API works on streams of any size without problems.
 203 -  When new API is introduced into Boost.Locale in future, such that it likely works
 204    on large chunks of text, will provide an interface for non-linear text handling.
 205
 206
 207 \section why_abstract_api Why all Boost.Locale implementation is hidden behind abstract interfaces and does not use template metaprogramming?
 208
 209 There are several major reasons:
 210
 211 - This is how the C++'s \c std::locale class is build. Each feature is represented using a subclass of
 212   \c std::locale::facet that provides an abstract API for specific operations it works on, see \ref std_locales.
 213 - This approach allows to switch underlying API without changing the actual application code even in run-time depending
 214   on performance and localization requirements.
 215 - This approach reduces compilation times significantly. This is very important for library that may be
 216   used in almost every part of specific program.
 217
 218 \section why_no_special_character_type Why Boost.Locale does not provide char16_t/char32_t for non-C++0x platforms.
 219
 220 There are several reasons:
 221
 222 - C++0x defines \c char16_t and \c char32_t as distinct types, so substituting is with something like \c uint16_t or \c uint32_t
 223   would not work as for example writing \c uint16_t to \c uint32_t stream would write a number to stream.
 224 - The C++ locales system would work only if standard facets like \c std::num_put are installed into the
 225   existing instance of \c std::locale, however in the many standard C++ libraries these facets are specialized for each
 226   specific character that the standard library supports, so an attempt to create a new facet would
 227   fail as it is not specialized.
 228
 229 These are exactly the reasons why Boost.Locale fails with current limited C++0x characters support on GCC-4.5 (the second reason)
 230 and MSVC-2010 (the first reason)
 231
 232 So basically it is impossible to use non-C++ characters with the C++'s locales framework.
 233
 234 The best and the most portable solution is to use the C++'s \c char type and UTF-8 encodings.
 235
 236 */
 237
 238