]> git.proxmox.com Git - ceph.git/blame - ceph/doc/cephfs/posix.rst
update ceph source to reef 18.1.2
[ceph.git] / ceph / doc / cephfs / posix.rst
CommitLineData
7c673cae
FG
1========================
2 Differences from POSIX
3========================
4
5CephFS aims to adhere to POSIX semantics wherever possible. For
6example, in contrast to many other common network file systems like
7NFS, CephFS maintains strong cache coherency across clients. The goal
8is for processes communicating via the file system to behave the same
9when they are on different hosts as when they are on the same host.
10
11However, there are a few places where CephFS diverges from strict
12POSIX semantics for various reasons:
13
14- If a client is writing to a file and fails, its writes are not
15 necessarily atomic. That is, the client may call write(2) on a file
16 opened with O_SYNC with an 8 MB buffer and then crash and the write
17 may be only partially applied. (Almost all file systems, even local
18 file systems, have this behavior.)
19- In shared simultaneous writer situations, a write that crosses
20 object boundaries is not necessarily atomic. This means that you
21 could have writer A write "aa|aa" and writer B write "bb|bb"
22 simultaneously (where | is the object boundary), and end up with
23 "aa|bb" rather than the proper "aa|aa" or "bb|bb".
7c673cae
FG
24- Sparse files propagate incorrectly to the stat(2) st_blocks field.
25 Because CephFS does not explicitly track which parts of a file are
26 allocated/written, the st_blocks field is always populated by the
27 file size divided by the block size. This will cause tools like
28 du(1) to overestimate consumed space. (The recursive size field,
29 maintained by CephFS, also includes file "holes" in its count.)
30- When a file is mapped into memory via mmap(2) on multiple hosts,
31 writes are not coherently propagated to other clients' caches. That
32 is, if a page is cached on host A, and then updated on host B, host
33 A's page is not coherently invalidated. (Shared writable mmap
39ae355f 34 appears to be quite rare--we have yet to hear any complaints about this
7c673cae
FG
35 behavior, and implementing cache coherency properly is complex.)
36- CephFS clients present a hidden ``.snap`` directory that is used to
37 access, create, delete, and rename snapshots. Although the virtual
38 directory is excluded from readdir(2), any process that tries to
39 create a file or directory with the same name will get an error
40 code. The name of this hidden directory can be changed at mount
41 time with ``-o snapdirname=.somethingelse`` (Linux) or the config
42 option ``client_snapdir`` (libcephfs, ceph-fuse).
1e59de90
TL
43- CephFS does not currently maintain the ``atime`` field. Most applications
44 do not care, though this impacts some backup and data tiering
45 applications that can move unused data to a secondary storage system.
46 You may be able to workaround this for some use cases, as CephFS does
47 support setting ``atime`` via the ``setattr`` operation.
11fdf7f2
TL
48
49Perspective
50-----------
51
52People talk a lot about "POSIX compliance," but in reality most file
53system implementations do not strictly adhere to the spec, including
54local Linux file systems like ext4 and XFS. For example, for
55performance reasons, the atomicity requirements for reads are relaxed:
56processing reading from a file that is also being written may see torn
57results.
58
59Similarly, NFS has extremely weak consistency semantics when multiple
60clients are interacting with the same files or directories, opting
61instead for "close-to-open". In the world of network attached
62storage, where most environments use NFS, whether or not the server's
63file system is "fully POSIX" may not be relevant, and whether client
64applications notice depends on whether data is being shared between
65clients or not. NFS may also "tear" the results of concurrent writers
66as client data may not even be flushed to the server until the file is
67closed (and more generally writes will be significantly more
68time-shifted than CephFS, leading to less predictable results).
69
39ae355f
TL
70Regardless, these are all similar enough to POSIX, and applications still work
71most of the time. Many other storage systems (e.g., HDFS) claim to be
72"POSIX-like" but diverge significantly from the standard by dropping support
73for things like in-place file modifications, truncate, or directory renames.
11fdf7f2
TL
74
75Bottom line
76-----------
77
39ae355f 78CephFS relaxes more than local Linux kernel file systems (for example, writes
11fdf7f2
TL
79spanning object boundaries may be torn). It relaxes strictly less
80than NFS when it comes to multiclient consistency, and generally less
81than NFS when it comes to write atomicity.
82
83In other words, when it comes to POSIX, ::
84
85 HDFS < NFS < CephFS < {XFS, ext4}
eafe8130
TL
86
87
88fsync() and error reporting
89---------------------------
90
91POSIX is somewhat vague about the state of an inode after fsync reports
92an error. In general, CephFS uses the standard error-reporting
93mechanisms in the client's kernel, and therefore follows the same
9f95a23c 94conventions as other file systems.
eafe8130
TL
95
96In modern Linux kernels (v4.17 or later), writeback errors are reported
97once to every file description that is open at the time of the error. In
9f95a23c 98addition, unreported errors that occurred before the file description was
eafe8130
TL
99opened will also be returned on fsync.
100
101See `PostgreSQL's summary of fsync() error reporting across operating systems
102<https://wiki.postgresql.org/wiki/Fsync_Errors>`_ and `Matthew Wilcox's
103presentation on Linux IO error handling
104<https://www.youtube.com/watch?v=74c19hwY2oE>`_ for more information.