[ceph.git] / ceph / src / boost / libs / filesystem / doc / POSIX_filename_encoding.txt

http://www.linuxfromscratch.org/blfs/view/svn/introduction/locale-issues.html

"The POSIX standard mandates that the filename encoding is the encoding implied by the current LC_CTYPE locale category." 

-------

http://mail.nl.linux.org/linux-utf8/2001-02/msg00103.html

From: Markus Kuhn

Tom Tromey wrote on 2001-02-05 00:36 UTC:
> Kai> IMAO, a *real* filesystem should use some encoding of ISO 10646 -
> Kai> UTF-8, UTF-16, or UTF-32 are all viable options. The same should
> Kai> be true for the kernel filename interfaces.
> 
> I like this, but what should I do right now?

The POSIX kernel file system interface is engraved into stone and
extremely unlikely to change. File names are arbitrary binary strings,
with only the '/' and '\0' bytes having any special semantics. You can
use arbitrary coded character sets on it as long as they do not
introduce '/' and '\0' bytes spuriously. Writers and readers have to
somehow agree on what encoding to use and the only really practical way
is to use the same encoding on all systems that share files. Eventually,
everyone will be using UTF-8 for file names on POSIX systems. Right now,
I would recommend users to use only ASCII for filenames, as this is
already UTF-8 and therefore simplifies migration. Using the ISO 8859,
JIS, etc. filenames should soon be considered deprecated practice.

> I work on libgcj, the runtime component of gcj, the Java front end to
> GCC.  In libgcj of course we use UCS-2 everywhere, since that is what
> Java does.  Currently, for Unixy systems, we assume that all file
> names are UTF-8.

The best solution is to assume that the file names are in the
locale-specific multi-byte encoding. Simply use mbrtowc and wcrtomb to
convert between Unicode and the locale-dependent multi-byte encoding
used in file names and text files if the ISO C 99 symbol
__STDC_ISO_10646__ is defined (which guarantees that wchar_t = UCS). On
Linux, this has been the case since glibc 2.2.

> (Actually, we do something notably worse, which is
> assume that file names are Java-style UTF-8, with the weird encoding
> for \u0000.)

\u0000 = NUL was never a character allowed in filenames under POSIX.
Raise an exception if someone tries to use it in a filename. Problem
solved.

I never understood, why Java found it necessary to introduce two
distinct ASCII NUL characters.

------

Interesting idea. Use iconv to create shift-jis or other mbcs test cases.
Commit	Line	Data
7c673cae FG	1	http://www.linuxfromscratch.org/blfs/view/svn/introduction/locale-issues.html
	2
	3	"The POSIX standard mandates that the filename encoding is the encoding implied by the current LC_CTYPE locale category."
	4
	5	-------
	6
	7	http://mail.nl.linux.org/linux-utf8/2001-02/msg00103.html
	8
	9	From: Markus Kuhn
	10
	11	Tom Tromey wrote on 2001-02-05 00:36 UTC:
	12	> Kai> IMAO, a real filesystem should use some encoding of ISO 10646 -
	13	> Kai> UTF-8, UTF-16, or UTF-32 are all viable options. The same should
	14	> Kai> be true for the kernel filename interfaces.
	15	>
	16	> I like this, but what should I do right now?
	17
	18	The POSIX kernel file system interface is engraved into stone and
	19	extremely unlikely to change. File names are arbitrary binary strings,
	20	with only the '/' and '\0' bytes having any special semantics. You can
	21	use arbitrary coded character sets on it as long as they do not
	22	introduce '/' and '\0' bytes spuriously. Writers and readers have to
	23	somehow agree on what encoding to use and the only really practical way
	24	is to use the same encoding on all systems that share files. Eventually,
	25	everyone will be using UTF-8 for file names on POSIX systems. Right now,
	26	I would recommend users to use only ASCII for filenames, as this is
	27	already UTF-8 and therefore simplifies migration. Using the ISO 8859,
	28	JIS, etc. filenames should soon be considered deprecated practice.
	29
	30	> I work on libgcj, the runtime component of gcj, the Java front end to
	31	> GCC. In libgcj of course we use UCS-2 everywhere, since that is what
	32	> Java does. Currently, for Unixy systems, we assume that all file
	33	> names are UTF-8.
	34
	35	The best solution is to assume that the file names are in the
	36	locale-specific multi-byte encoding. Simply use mbrtowc and wcrtomb to
	37	convert between Unicode and the locale-dependent multi-byte encoding
	38	used in file names and text files if the ISO C 99 symbol
	39	__STDC_ISO_10646__ is defined (which guarantees that wchar_t = UCS). On
	40	Linux, this has been the case since glibc 2.2.
	41
	42	> (Actually, we do something notably worse, which is
	43	> assume that file names are Java-style UTF-8, with the weird encoding
	44	> for \u0000.)
	45
	46	\u0000 = NUL was never a character allowed in filenames under POSIX.
	47	Raise an exception if someone tries to use it in a filename. Problem
	48	solved.
	49
	50	I never understood, why Java found it necessary to introduce two
	51	distinct ASCII NUL characters.
	52
	53	------
	54
	55	Interesting idea. Use iconv to create shift-jis or other mbcs test cases.