ceph/doc/cephfs/posix.rst

   1 ========================
   2  Differences from POSIX
   3 ========================
   4
   5 CephFS aims to adhere to POSIX semantics wherever possible.  For
   6 example, in contrast to many other common network file systems like
   7 NFS, CephFS maintains strong cache coherency across clients.  The goal
   8 is for processes communicating via the file system to behave the same
   9 when they are on different hosts as when they are on the same host.
  10
  11 However, there are a few places where CephFS diverges from strict
  12 POSIX semantics for various reasons:
  13
  14 - If a client is writing to a file and fails, its writes are not
  15   necessarily atomic. That is, the client may call write(2) on a file
  16   opened with O_SYNC with an 8 MB buffer and then crash and the write
  17   may be only partially applied.  (Almost all file systems, even local
  18   file systems, have this behavior.)
  19 - In shared simultaneous writer situations, a write that crosses
  20   object boundaries is not necessarily atomic. This means that you
  21   could have writer A write "aa|aa" and writer B write "bb|bb"
  22   simultaneously (where | is the object boundary), and end up with
  23   "aa|bb" rather than the proper "aa|aa" or "bb|bb".
  24 - Sparse files propagate incorrectly to the stat(2) st_blocks field.
  25   Because CephFS does not explicitly track which parts of a file are
  26   allocated/written, the st_blocks field is always populated by the
  27   file size divided by the block size.  This will cause tools like
  28   du(1) to overestimate consumed space.  (The recursive size field,
  29   maintained by CephFS, also includes file "holes" in its count.)
  30 - When a file is mapped into memory via mmap(2) on multiple hosts,
  31   writes are not coherently propagated to other clients' caches.  That
  32   is, if a page is cached on host A, and then updated on host B, host
  33   A's page is not coherently invalidated.  (Shared writable mmap
  34   appears to be quite rare--we have yet to here any complaints about this
  35   behavior, and implementing cache coherency properly is complex.)
  36 - CephFS clients present a hidden ``.snap`` directory that is used to
  37   access, create, delete, and rename snapshots.  Although the virtual
  38   directory is excluded from readdir(2), any process that tries to
  39   create a file or directory with the same name will get an error
  40   code.  The name of this hidden directory can be changed at mount
  41   time with ``-o snapdirname=.somethingelse`` (Linux) or the config
  42   option ``client_snapdir`` (libcephfs, ceph-fuse).
  43
  44 Perspective
  45 -----------
  46
  47 People talk a lot about "POSIX compliance," but in reality most file
  48 system implementations do not strictly adhere to the spec, including
  49 local Linux file systems like ext4 and XFS.  For example, for
  50 performance reasons, the atomicity requirements for reads are relaxed:
  51 processing reading from a file that is also being written may see torn
  52 results.
  53
  54 Similarly, NFS has extremely weak consistency semantics when multiple
  55 clients are interacting with the same files or directories, opting
  56 instead for "close-to-open".  In the world of network attached
  57 storage, where most environments use NFS, whether or not the server's
  58 file system is "fully POSIX" may not be relevant, and whether client
  59 applications notice depends on whether data is being shared between
  60 clients or not.  NFS may also "tear" the results of concurrent writers
  61 as client data may not even be flushed to the server until the file is
  62 closed (and more generally writes will be significantly more
  63 time-shifted than CephFS, leading to less predictable results).
  64
  65 However, all of there are very close to POSIX, and most of the time
  66 applications don't notice too much.  Many other storage systems (e.g.,
  67 HDFS) claim to be "POSIX-like" but diverge significantly from the
  68 standard by dropping support for things like in-place file
  69 modifications, truncate, or directory renames.
  70
  71
  72 Bottom line
  73 -----------
  74
  75 CephFS relaxes more than local Linux kernel file systems (e.g., writes
  76 spanning object boundaries may be torn).  It relaxes strictly less
  77 than NFS when it comes to multiclient consistency, and generally less
  78 than NFS when it comes to write atomicity.
  79
  80 In other words, when it comes to POSIX, ::
  81
  82   HDFS < NFS < CephFS < {XFS, ext4}
  83
  84
  85 fsync() and error reporting
  86 ---------------------------
  87
  88 POSIX is somewhat vague about the state of an inode after fsync reports
  89 an error. In general, CephFS uses the standard error-reporting
  90 mechanisms in the client's kernel, and therefore follows the same
  91 conventions as other file systems.
  92
  93 In modern Linux kernels (v4.17 or later), writeback errors are reported
  94 once to every file description that is open at the time of the error. In
  95 addition, unreported errors that occurred before the file description was
  96 opened will also be returned on fsync.
  97
  98 See `PostgreSQL's summary of fsync() error reporting across operating systems
  99 <https://wiki.postgresql.org/wiki/Fsync_Errors>`_ and `Matthew Wilcox's
 100 presentation on Linux IO error handling
 101 <https://www.youtube.com/watch?v=74c19hwY2oE>`_ for more information.