]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | http://www.linuxfromscratch.org/blfs/view/svn/introduction/locale-issues.html |
2 | ||
3 | "The POSIX standard mandates that the filename encoding is the encoding implied by the current LC_CTYPE locale category." | |
4 | ||
5 | ------- | |
6 | ||
7 | http://mail.nl.linux.org/linux-utf8/2001-02/msg00103.html | |
8 | ||
9 | From: Markus Kuhn | |
10 | ||
11 | Tom Tromey wrote on 2001-02-05 00:36 UTC: | |
12 | > Kai> IMAO, a *real* filesystem should use some encoding of ISO 10646 - | |
13 | > Kai> UTF-8, UTF-16, or UTF-32 are all viable options. The same should | |
14 | > Kai> be true for the kernel filename interfaces. | |
15 | > | |
16 | > I like this, but what should I do right now? | |
17 | ||
18 | The POSIX kernel file system interface is engraved into stone and | |
19 | extremely unlikely to change. File names are arbitrary binary strings, | |
20 | with only the '/' and '\0' bytes having any special semantics. You can | |
21 | use arbitrary coded character sets on it as long as they do not | |
22 | introduce '/' and '\0' bytes spuriously. Writers and readers have to | |
23 | somehow agree on what encoding to use and the only really practical way | |
24 | is to use the same encoding on all systems that share files. Eventually, | |
25 | everyone will be using UTF-8 for file names on POSIX systems. Right now, | |
26 | I would recommend users to use only ASCII for filenames, as this is | |
27 | already UTF-8 and therefore simplifies migration. Using the ISO 8859, | |
28 | JIS, etc. filenames should soon be considered deprecated practice. | |
29 | ||
30 | > I work on libgcj, the runtime component of gcj, the Java front end to | |
31 | > GCC. In libgcj of course we use UCS-2 everywhere, since that is what | |
32 | > Java does. Currently, for Unixy systems, we assume that all file | |
33 | > names are UTF-8. | |
34 | ||
35 | The best solution is to assume that the file names are in the | |
36 | locale-specific multi-byte encoding. Simply use mbrtowc and wcrtomb to | |
37 | convert between Unicode and the locale-dependent multi-byte encoding | |
38 | used in file names and text files if the ISO C 99 symbol | |
39 | __STDC_ISO_10646__ is defined (which guarantees that wchar_t = UCS). On | |
40 | Linux, this has been the case since glibc 2.2. | |
41 | ||
42 | > (Actually, we do something notably worse, which is | |
43 | > assume that file names are Java-style UTF-8, with the weird encoding | |
44 | > for \u0000.) | |
45 | ||
46 | \u0000 = NUL was never a character allowed in filenames under POSIX. | |
47 | Raise an exception if someone tries to use it in a filename. Problem | |
48 | solved. | |
49 | ||
50 | I never understood, why Java found it necessary to introduce two | |
51 | distinct ASCII NUL characters. | |
52 | ||
53 | ------ | |
54 | ||
55 | Interesting idea. Use iconv to create shift-jis or other mbcs test cases. |