[ceph.git] / ceph / doc / cephfs / posix.rst

========================
 Differences from POSIX
========================

CephFS aims to adhere to POSIX semantics wherever possible.  For
example, in contrast to many other common network file systems like
NFS, CephFS maintains strong cache coherency across clients.  The goal
is for processes communicating via the file system to behave the same
when they are on different hosts as when they are on the same host.

However, there are a few places where CephFS diverges from strict
POSIX semantics for various reasons:

- If a client is writing to a file and fails, its writes are not
  necessarily atomic. That is, the client may call write(2) on a file
  opened with O_SYNC with an 8 MB buffer and then crash and the write
  may be only partially applied.  (Almost all file systems, even local
  file systems, have this behavior.)
- In shared simultaneous writer situations, a write that crosses
  object boundaries is not necessarily atomic. This means that you
  could have writer A write "aa|aa" and writer B write "bb|bb"
  simultaneously (where | is the object boundary), and end up with
  "aa|bb" rather than the proper "aa|aa" or "bb|bb".
- Sparse files propagate incorrectly to the stat(2) st_blocks field.
  Because CephFS does not explicitly track which parts of a file are
  allocated/written, the st_blocks field is always populated by the
  file size divided by the block size.  This will cause tools like
  du(1) to overestimate consumed space.  (The recursive size field,
  maintained by CephFS, also includes file "holes" in its count.)
- When a file is mapped into memory via mmap(2) on multiple hosts,
  writes are not coherently propagated to other clients' caches.  That
  is, if a page is cached on host A, and then updated on host B, host
  A's page is not coherently invalidated.  (Shared writable mmap
  appears to be quite rare--we have yet to hear any complaints about this
  behavior, and implementing cache coherency properly is complex.)
- CephFS clients present a hidden ``.snap`` directory that is used to
  access, create, delete, and rename snapshots.  Although the virtual
  directory is excluded from readdir(2), any process that tries to
  create a file or directory with the same name will get an error
  code.  The name of this hidden directory can be changed at mount
  time with ``-o snapdirname=.somethingelse`` (Linux) or the config
  option ``client_snapdir`` (libcephfs, ceph-fuse).

Perspective
-----------

People talk a lot about "POSIX compliance," but in reality most file
system implementations do not strictly adhere to the spec, including
local Linux file systems like ext4 and XFS.  For example, for
performance reasons, the atomicity requirements for reads are relaxed:
processing reading from a file that is also being written may see torn
results.

Similarly, NFS has extremely weak consistency semantics when multiple
clients are interacting with the same files or directories, opting
instead for "close-to-open".  In the world of network attached
storage, where most environments use NFS, whether or not the server's
file system is "fully POSIX" may not be relevant, and whether client
applications notice depends on whether data is being shared between
clients or not.  NFS may also "tear" the results of concurrent writers
as client data may not even be flushed to the server until the file is
closed (and more generally writes will be significantly more
time-shifted than CephFS, leading to less predictable results).

Regardless, these are all similar enough to POSIX, and applications still work
most of the time. Many other storage systems (e.g., HDFS) claim to be
"POSIX-like" but diverge significantly from the standard by dropping support
for things like in-place file modifications, truncate, or directory renames.

Bottom line
-----------

CephFS relaxes more than local Linux kernel file systems (for example, writes
spanning object boundaries may be torn).  It relaxes strictly less
than NFS when it comes to multiclient consistency, and generally less
than NFS when it comes to write atomicity.

In other words, when it comes to POSIX, ::

  HDFS < NFS < CephFS < {XFS, ext4}


fsync() and error reporting
---------------------------

POSIX is somewhat vague about the state of an inode after fsync reports
an error. In general, CephFS uses the standard error-reporting
mechanisms in the client's kernel, and therefore follows the same
conventions as other file systems.

In modern Linux kernels (v4.17 or later), writeback errors are reported
once to every file description that is open at the time of the error. In
addition, unreported errors that occurred before the file description was
opened will also be returned on fsync.

See `PostgreSQL's summary of fsync() error reporting across operating systems
<https://wiki.postgresql.org/wiki/Fsync_Errors>`_ and `Matthew Wilcox's
presentation on Linux IO error handling
<https://www.youtube.com/watch?v=74c19hwY2oE>`_ for more information.
Commit	Line	Data
7c673cae FG	1	========================
	2	Differences from POSIX
	3	========================
	4
	5	CephFS aims to adhere to POSIX semantics wherever possible. For
	6	example, in contrast to many other common network file systems like
	7	NFS, CephFS maintains strong cache coherency across clients. The goal
	8	is for processes communicating via the file system to behave the same
	9	when they are on different hosts as when they are on the same host.
	10
	11	However, there are a few places where CephFS diverges from strict
	12	POSIX semantics for various reasons:
	13
	14	- If a client is writing to a file and fails, its writes are not
	15	necessarily atomic. That is, the client may call write(2) on a file
	16	opened with O_SYNC with an 8 MB buffer and then crash and the write
	17	may be only partially applied. (Almost all file systems, even local
	18	file systems, have this behavior.)
	19	- In shared simultaneous writer situations, a write that crosses
	20	object boundaries is not necessarily atomic. This means that you
	21	could have writer A write "aa\|aa" and writer B write "bb\|bb"
	22	simultaneously (where \| is the object boundary), and end up with
	23	"aa\|bb" rather than the proper "aa\|aa" or "bb\|bb".
7c673cae FG	24	- Sparse files propagate incorrectly to the stat(2) st_blocks field.
	25	Because CephFS does not explicitly track which parts of a file are
	26	allocated/written, the st_blocks field is always populated by the
	27	file size divided by the block size. This will cause tools like
	28	du(1) to overestimate consumed space. (The recursive size field,
	29	maintained by CephFS, also includes file "holes" in its count.)
	30	- When a file is mapped into memory via mmap(2) on multiple hosts,
	31	writes are not coherently propagated to other clients' caches. That
	32	is, if a page is cached on host A, and then updated on host B, host
	33	A's page is not coherently invalidated. (Shared writable mmap
39ae355f	34	appears to be quite rare--we have yet to hear any complaints about this
7c673cae FG	35	behavior, and implementing cache coherency properly is complex.)
	36	- CephFS clients present a hidden ``.snap`` directory that is used to
	37	access, create, delete, and rename snapshots. Although the virtual
	38	directory is excluded from readdir(2), any process that tries to
	39	create a file or directory with the same name will get an error
	40	code. The name of this hidden directory can be changed at mount
	41	time with ``-o snapdirname=.somethingelse`` (Linux) or the config
	42	option ``client_snapdir`` (libcephfs, ceph-fuse).
11fdf7f2 TL	43
	44	Perspective
	45	-----------
	46
	47	People talk a lot about "POSIX compliance," but in reality most file
	48	system implementations do not strictly adhere to the spec, including
	49	local Linux file systems like ext4 and XFS. For example, for
	50	performance reasons, the atomicity requirements for reads are relaxed:
	51	processing reading from a file that is also being written may see torn
	52	results.
	53
	54	Similarly, NFS has extremely weak consistency semantics when multiple
	55	clients are interacting with the same files or directories, opting
	56	instead for "close-to-open". In the world of network attached
	57	storage, where most environments use NFS, whether or not the server's
	58	file system is "fully POSIX" may not be relevant, and whether client
	59	applications notice depends on whether data is being shared between
	60	clients or not. NFS may also "tear" the results of concurrent writers
	61	as client data may not even be flushed to the server until the file is
	62	closed (and more generally writes will be significantly more
	63	time-shifted than CephFS, leading to less predictable results).
	64
39ae355f TL	65	Regardless, these are all similar enough to POSIX, and applications still work
	66	most of the time. Many other storage systems (e.g., HDFS) claim to be
	67	"POSIX-like" but diverge significantly from the standard by dropping support
	68	for things like in-place file modifications, truncate, or directory renames.
11fdf7f2 TL	69
	70	Bottom line
	71	-----------
	72
39ae355f	73	CephFS relaxes more than local Linux kernel file systems (for example, writes
11fdf7f2 TL	74	spanning object boundaries may be torn). It relaxes strictly less
	75	than NFS when it comes to multiclient consistency, and generally less
	76	than NFS when it comes to write atomicity.
	77
	78	In other words, when it comes to POSIX, ::
	79
	80	HDFS < NFS < CephFS < {XFS, ext4}
eafe8130 TL	81
	82
	83	fsync() and error reporting
	84	---------------------------
	85
	86	POSIX is somewhat vague about the state of an inode after fsync reports
	87	an error. In general, CephFS uses the standard error-reporting
	88	mechanisms in the client's kernel, and therefore follows the same
9f95a23c	89	conventions as other file systems.
eafe8130 TL	90
	91	In modern Linux kernels (v4.17 or later), writeback errors are reported
	92	once to every file description that is open at the time of the error. In
9f95a23c	93	addition, unreported errors that occurred before the file description was
eafe8130 TL	94	opened will also be returned on fsync.
	95
	96	See `PostgreSQL's summary of fsync() error reporting across operating systems
	97	<https://wiki.postgresql.org/wiki/Fsync_Errors>`_ and `Matthew Wilcox's
	98	presentation on Linux IO error handling
	99	<https://www.youtube.com/watch?v=74c19hwY2oE>`_ for more information.