]> git.proxmox.com Git - ceph.git/blame - ceph/src/boost/libs/filesystem/doc/POSIX_filename_encoding.txt
bump version to 12.2.2-pve1
[ceph.git] / ceph / src / boost / libs / filesystem / doc / POSIX_filename_encoding.txt
CommitLineData
7c673cae
FG
1http://www.linuxfromscratch.org/blfs/view/svn/introduction/locale-issues.html
2
3"The POSIX standard mandates that the filename encoding is the encoding implied by the current LC_CTYPE locale category."
4
5-------
6
7http://mail.nl.linux.org/linux-utf8/2001-02/msg00103.html
8
9From: Markus Kuhn
10
11Tom Tromey wrote on 2001-02-05 00:36 UTC:
12> Kai> IMAO, a *real* filesystem should use some encoding of ISO 10646 -
13> Kai> UTF-8, UTF-16, or UTF-32 are all viable options. The same should
14> Kai> be true for the kernel filename interfaces.
15>
16> I like this, but what should I do right now?
17
18The POSIX kernel file system interface is engraved into stone and
19extremely unlikely to change. File names are arbitrary binary strings,
20with only the '/' and '\0' bytes having any special semantics. You can
21use arbitrary coded character sets on it as long as they do not
22introduce '/' and '\0' bytes spuriously. Writers and readers have to
23somehow agree on what encoding to use and the only really practical way
24is to use the same encoding on all systems that share files. Eventually,
25everyone will be using UTF-8 for file names on POSIX systems. Right now,
26I would recommend users to use only ASCII for filenames, as this is
27already UTF-8 and therefore simplifies migration. Using the ISO 8859,
28JIS, etc. filenames should soon be considered deprecated practice.
29
30> I work on libgcj, the runtime component of gcj, the Java front end to
31> GCC. In libgcj of course we use UCS-2 everywhere, since that is what
32> Java does. Currently, for Unixy systems, we assume that all file
33> names are UTF-8.
34
35The best solution is to assume that the file names are in the
36locale-specific multi-byte encoding. Simply use mbrtowc and wcrtomb to
37convert between Unicode and the locale-dependent multi-byte encoding
38used in file names and text files if the ISO C 99 symbol
39__STDC_ISO_10646__ is defined (which guarantees that wchar_t = UCS). On
40Linux, this has been the case since glibc 2.2.
41
42> (Actually, we do something notably worse, which is
43> assume that file names are Java-style UTF-8, with the weird encoding
44> for \u0000.)
45
46\u0000 = NUL was never a character allowed in filenames under POSIX.
47Raise an exception if someone tries to use it in a filename. Problem
48solved.
49
50I never understood, why Java found it necessary to introduce two
51distinct ASCII NUL characters.
52
53------
54
55Interesting idea. Use iconv to create shift-jis or other mbcs test cases.