]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ======================== |
2 | Differences from POSIX | |
3 | ======================== | |
4 | ||
5 | CephFS aims to adhere to POSIX semantics wherever possible. For | |
6 | example, in contrast to many other common network file systems like | |
7 | NFS, CephFS maintains strong cache coherency across clients. The goal | |
8 | is for processes communicating via the file system to behave the same | |
9 | when they are on different hosts as when they are on the same host. | |
10 | ||
11 | However, there are a few places where CephFS diverges from strict | |
12 | POSIX semantics for various reasons: | |
13 | ||
14 | - If a client is writing to a file and fails, its writes are not | |
15 | necessarily atomic. That is, the client may call write(2) on a file | |
16 | opened with O_SYNC with an 8 MB buffer and then crash and the write | |
17 | may be only partially applied. (Almost all file systems, even local | |
18 | file systems, have this behavior.) | |
19 | - In shared simultaneous writer situations, a write that crosses | |
20 | object boundaries is not necessarily atomic. This means that you | |
21 | could have writer A write "aa|aa" and writer B write "bb|bb" | |
22 | simultaneously (where | is the object boundary), and end up with | |
23 | "aa|bb" rather than the proper "aa|aa" or "bb|bb". | |
7c673cae FG |
24 | - Sparse files propagate incorrectly to the stat(2) st_blocks field. |
25 | Because CephFS does not explicitly track which parts of a file are | |
26 | allocated/written, the st_blocks field is always populated by the | |
27 | file size divided by the block size. This will cause tools like | |
28 | du(1) to overestimate consumed space. (The recursive size field, | |
29 | maintained by CephFS, also includes file "holes" in its count.) | |
30 | - When a file is mapped into memory via mmap(2) on multiple hosts, | |
31 | writes are not coherently propagated to other clients' caches. That | |
32 | is, if a page is cached on host A, and then updated on host B, host | |
33 | A's page is not coherently invalidated. (Shared writable mmap | |
39ae355f | 34 | appears to be quite rare--we have yet to hear any complaints about this |
7c673cae FG |
35 | behavior, and implementing cache coherency properly is complex.) |
36 | - CephFS clients present a hidden ``.snap`` directory that is used to | |
37 | access, create, delete, and rename snapshots. Although the virtual | |
38 | directory is excluded from readdir(2), any process that tries to | |
39 | create a file or directory with the same name will get an error | |
40 | code. The name of this hidden directory can be changed at mount | |
41 | time with ``-o snapdirname=.somethingelse`` (Linux) or the config | |
42 | option ``client_snapdir`` (libcephfs, ceph-fuse). | |
11fdf7f2 TL |
43 | |
44 | Perspective | |
45 | ----------- | |
46 | ||
47 | People talk a lot about "POSIX compliance," but in reality most file | |
48 | system implementations do not strictly adhere to the spec, including | |
49 | local Linux file systems like ext4 and XFS. For example, for | |
50 | performance reasons, the atomicity requirements for reads are relaxed: | |
51 | processing reading from a file that is also being written may see torn | |
52 | results. | |
53 | ||
54 | Similarly, NFS has extremely weak consistency semantics when multiple | |
55 | clients are interacting with the same files or directories, opting | |
56 | instead for "close-to-open". In the world of network attached | |
57 | storage, where most environments use NFS, whether or not the server's | |
58 | file system is "fully POSIX" may not be relevant, and whether client | |
59 | applications notice depends on whether data is being shared between | |
60 | clients or not. NFS may also "tear" the results of concurrent writers | |
61 | as client data may not even be flushed to the server until the file is | |
62 | closed (and more generally writes will be significantly more | |
63 | time-shifted than CephFS, leading to less predictable results). | |
64 | ||
39ae355f TL |
65 | Regardless, these are all similar enough to POSIX, and applications still work |
66 | most of the time. Many other storage systems (e.g., HDFS) claim to be | |
67 | "POSIX-like" but diverge significantly from the standard by dropping support | |
68 | for things like in-place file modifications, truncate, or directory renames. | |
11fdf7f2 TL |
69 | |
70 | Bottom line | |
71 | ----------- | |
72 | ||
39ae355f | 73 | CephFS relaxes more than local Linux kernel file systems (for example, writes |
11fdf7f2 TL |
74 | spanning object boundaries may be torn). It relaxes strictly less |
75 | than NFS when it comes to multiclient consistency, and generally less | |
76 | than NFS when it comes to write atomicity. | |
77 | ||
78 | In other words, when it comes to POSIX, :: | |
79 | ||
80 | HDFS < NFS < CephFS < {XFS, ext4} | |
eafe8130 TL |
81 | |
82 | ||
83 | fsync() and error reporting | |
84 | --------------------------- | |
85 | ||
86 | POSIX is somewhat vague about the state of an inode after fsync reports | |
87 | an error. In general, CephFS uses the standard error-reporting | |
88 | mechanisms in the client's kernel, and therefore follows the same | |
9f95a23c | 89 | conventions as other file systems. |
eafe8130 TL |
90 | |
91 | In modern Linux kernels (v4.17 or later), writeback errors are reported | |
92 | once to every file description that is open at the time of the error. In | |
9f95a23c | 93 | addition, unreported errors that occurred before the file description was |
eafe8130 TL |
94 | opened will also be returned on fsync. |
95 | ||
96 | See `PostgreSQL's summary of fsync() error reporting across operating systems | |
97 | <https://wiki.postgresql.org/wiki/Fsync_Errors>`_ and `Matthew Wilcox's | |
98 | presentation on Linux IO error handling | |
99 | <https://www.youtube.com/watch?v=74c19hwY2oE>`_ for more information. |