]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ======================== |
2 | Differences from POSIX | |
3 | ======================== | |
4 | ||
5 | CephFS aims to adhere to POSIX semantics wherever possible. For | |
6 | example, in contrast to many other common network file systems like | |
7 | NFS, CephFS maintains strong cache coherency across clients. The goal | |
8 | is for processes communicating via the file system to behave the same | |
9 | when they are on different hosts as when they are on the same host. | |
10 | ||
11 | However, there are a few places where CephFS diverges from strict | |
12 | POSIX semantics for various reasons: | |
13 | ||
14 | - If a client is writing to a file and fails, its writes are not | |
15 | necessarily atomic. That is, the client may call write(2) on a file | |
16 | opened with O_SYNC with an 8 MB buffer and then crash and the write | |
17 | may be only partially applied. (Almost all file systems, even local | |
18 | file systems, have this behavior.) | |
19 | - In shared simultaneous writer situations, a write that crosses | |
20 | object boundaries is not necessarily atomic. This means that you | |
21 | could have writer A write "aa|aa" and writer B write "bb|bb" | |
22 | simultaneously (where | is the object boundary), and end up with | |
23 | "aa|bb" rather than the proper "aa|aa" or "bb|bb". | |
7c673cae FG |
24 | - Sparse files propagate incorrectly to the stat(2) st_blocks field. |
25 | Because CephFS does not explicitly track which parts of a file are | |
26 | allocated/written, the st_blocks field is always populated by the | |
27 | file size divided by the block size. This will cause tools like | |
28 | du(1) to overestimate consumed space. (The recursive size field, | |
29 | maintained by CephFS, also includes file "holes" in its count.) | |
30 | - When a file is mapped into memory via mmap(2) on multiple hosts, | |
31 | writes are not coherently propagated to other clients' caches. That | |
32 | is, if a page is cached on host A, and then updated on host B, host | |
33 | A's page is not coherently invalidated. (Shared writable mmap | |
39ae355f | 34 | appears to be quite rare--we have yet to hear any complaints about this |
7c673cae FG |
35 | behavior, and implementing cache coherency properly is complex.) |
36 | - CephFS clients present a hidden ``.snap`` directory that is used to | |
37 | access, create, delete, and rename snapshots. Although the virtual | |
38 | directory is excluded from readdir(2), any process that tries to | |
39 | create a file or directory with the same name will get an error | |
40 | code. The name of this hidden directory can be changed at mount | |
41 | time with ``-o snapdirname=.somethingelse`` (Linux) or the config | |
42 | option ``client_snapdir`` (libcephfs, ceph-fuse). | |
1e59de90 TL |
43 | - CephFS does not currently maintain the ``atime`` field. Most applications |
44 | do not care, though this impacts some backup and data tiering | |
45 | applications that can move unused data to a secondary storage system. | |
46 | You may be able to workaround this for some use cases, as CephFS does | |
47 | support setting ``atime`` via the ``setattr`` operation. | |
11fdf7f2 TL |
48 | |
49 | Perspective | |
50 | ----------- | |
51 | ||
52 | People talk a lot about "POSIX compliance," but in reality most file | |
53 | system implementations do not strictly adhere to the spec, including | |
54 | local Linux file systems like ext4 and XFS. For example, for | |
55 | performance reasons, the atomicity requirements for reads are relaxed: | |
56 | processing reading from a file that is also being written may see torn | |
57 | results. | |
58 | ||
59 | Similarly, NFS has extremely weak consistency semantics when multiple | |
60 | clients are interacting with the same files or directories, opting | |
61 | instead for "close-to-open". In the world of network attached | |
62 | storage, where most environments use NFS, whether or not the server's | |
63 | file system is "fully POSIX" may not be relevant, and whether client | |
64 | applications notice depends on whether data is being shared between | |
65 | clients or not. NFS may also "tear" the results of concurrent writers | |
66 | as client data may not even be flushed to the server until the file is | |
67 | closed (and more generally writes will be significantly more | |
68 | time-shifted than CephFS, leading to less predictable results). | |
69 | ||
39ae355f TL |
70 | Regardless, these are all similar enough to POSIX, and applications still work |
71 | most of the time. Many other storage systems (e.g., HDFS) claim to be | |
72 | "POSIX-like" but diverge significantly from the standard by dropping support | |
73 | for things like in-place file modifications, truncate, or directory renames. | |
11fdf7f2 TL |
74 | |
75 | Bottom line | |
76 | ----------- | |
77 | ||
39ae355f | 78 | CephFS relaxes more than local Linux kernel file systems (for example, writes |
11fdf7f2 TL |
79 | spanning object boundaries may be torn). It relaxes strictly less |
80 | than NFS when it comes to multiclient consistency, and generally less | |
81 | than NFS when it comes to write atomicity. | |
82 | ||
83 | In other words, when it comes to POSIX, :: | |
84 | ||
85 | HDFS < NFS < CephFS < {XFS, ext4} | |
eafe8130 TL |
86 | |
87 | ||
88 | fsync() and error reporting | |
89 | --------------------------- | |
90 | ||
91 | POSIX is somewhat vague about the state of an inode after fsync reports | |
92 | an error. In general, CephFS uses the standard error-reporting | |
93 | mechanisms in the client's kernel, and therefore follows the same | |
9f95a23c | 94 | conventions as other file systems. |
eafe8130 TL |
95 | |
96 | In modern Linux kernels (v4.17 or later), writeback errors are reported | |
97 | once to every file description that is open at the time of the error. In | |
9f95a23c | 98 | addition, unreported errors that occurred before the file description was |
eafe8130 TL |
99 | opened will also be returned on fsync. |
100 | ||
101 | See `PostgreSQL's summary of fsync() error reporting across operating systems | |
102 | <https://wiki.postgresql.org/wiki/Fsync_Errors>`_ and `Matthew Wilcox's | |
103 | presentation on Linux IO error handling | |
104 | <https://www.youtube.com/watch?v=74c19hwY2oE>`_ for more information. |