]>
Commit | Line | Data |
---|---|---|
5b88fdd9 SF |
1 | |
2 | # Copyright (C) 2005-2016 Junjiro R. Okajima | |
3 | # | |
4 | # This program is free software; you can redistribute it and/or modify | |
5 | # it under the terms of the GNU General Public License as published by | |
6 | # the Free Software Foundation; either version 2 of the License, or | |
7 | # (at your option) any later version. | |
8 | # | |
9 | # This program is distributed in the hope that it will be useful, | |
10 | # but WITHOUT ANY WARRANTY; without even the implied warranty of | |
11 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
12 | # GNU General Public License for more details. | |
13 | # | |
14 | # You should have received a copy of the GNU General Public License | |
15 | # along with this program. If not, see <http://www.gnu.org/licenses/>. | |
16 | ||
17 | Introduction | |
18 | ---------------------------------------- | |
19 | ||
20 | aufs [ei ju: ef es] | [a u f s] | |
21 | 1. abbrev. for "advanced multi-layered unification filesystem". | |
22 | 2. abbrev. for "another unionfs". | |
23 | 3. abbrev. for "auf das" in German which means "on the" in English. | |
24 | Ex. "Butter aufs Brot"(G) means "butter onto bread"(E). | |
25 | But "Filesystem aufs Filesystem" is hard to understand. | |
26 | ||
27 | AUFS is a filesystem with features: | |
28 | - multi layered stackable unification filesystem, the member directory | |
29 | is called as a branch. | |
30 | - branch permission and attribute, 'readonly', 'real-readonly', | |
31 | 'readwrite', 'whiteout-able', 'link-able whiteout', etc. and their | |
32 | combination. | |
33 | - internal "file copy-on-write". | |
34 | - logical deletion, whiteout. | |
35 | - dynamic branch manipulation, adding, deleting and changing permission. | |
36 | - allow bypassing aufs, user's direct branch access. | |
37 | - external inode number translation table and bitmap which maintains the | |
38 | persistent aufs inode number. | |
39 | - seekable directory, including NFS readdir. | |
40 | - file mapping, mmap and sharing pages. | |
41 | - pseudo-link, hardlink over branches. | |
42 | - loopback mounted filesystem as a branch. | |
43 | - several policies to select one among multiple writable branches. | |
44 | - revert a single systemcall when an error occurs in aufs. | |
45 | - and more... | |
46 | ||
47 | ||
48 | Multi Layered Stackable Unification Filesystem | |
49 | ---------------------------------------------------------------------- | |
50 | Most people already knows what it is. | |
51 | It is a filesystem which unifies several directories and provides a | |
52 | merged single directory. When users access a file, the access will be | |
53 | passed/re-directed/converted (sorry, I am not sure which English word is | |
54 | correct) to the real file on the member filesystem. The member | |
55 | filesystem is called 'lower filesystem' or 'branch' and has a mode | |
56 | 'readonly' and 'readwrite.' And the deletion for a file on the lower | |
57 | readonly branch is handled by creating 'whiteout' on the upper writable | |
58 | branch. | |
59 | ||
60 | On LKML, there have been discussions about UnionMount (Jan Blunck, | |
61 | Bharata B Rao and Valerie Aurora) and Unionfs (Erez Zadok). They took | |
62 | different approaches to implement the merged-view. | |
63 | The former tries putting it into VFS, and the latter implements as a | |
64 | separate filesystem. | |
65 | (If I misunderstand about these implementations, please let me know and | |
66 | I shall correct it. Because it is a long time ago when I read their | |
67 | source files last time). | |
68 | ||
69 | UnionMount's approach will be able to small, but may be hard to share | |
70 | branches between several UnionMount since the whiteout in it is | |
71 | implemented in the inode on branch filesystem and always | |
72 | shared. According to Bharata's post, readdir does not seems to be | |
73 | finished yet. | |
74 | There are several missing features known in this implementations such as | |
75 | - for users, the inode number may change silently. eg. copy-up. | |
76 | - link(2) may break by copy-up. | |
77 | - read(2) may get an obsoleted filedata (fstat(2) too). | |
78 | - fcntl(F_SETLK) may be broken by copy-up. | |
79 | - unnecessary copy-up may happen, for example mmap(MAP_PRIVATE) after | |
80 | open(O_RDWR). | |
81 | ||
82 | In linux-3.18, "overlay" filesystem (formerly known as "overlayfs") was | |
83 | merged into mainline. This is another implementation of UnionMount as a | |
84 | separated filesystem. All the limitations and known problems which | |
85 | UnionMount are equally inherited to "overlay" filesystem. | |
86 | ||
87 | Unionfs has a longer history. When I started implementing a stackable | |
88 | filesystem (Aug 2005), it already existed. It has virtual super_block, | |
89 | inode, dentry and file objects and they have an array pointing lower | |
90 | same kind objects. After contributing many patches for Unionfs, I | |
91 | re-started my project AUFS (Jun 2006). | |
92 | ||
93 | In AUFS, the structure of filesystem resembles to Unionfs, but I | |
94 | implemented my own ideas, approaches and enhancements and it became | |
95 | totally different one. | |
96 | ||
97 | Comparing DM snapshot and fs based implementation | |
98 | - the number of bytes to be copied between devices is much smaller. | |
99 | - the type of filesystem must be one and only. | |
100 | - the fs must be writable, no readonly fs, even for the lower original | |
101 | device. so the compression fs will not be usable. but if we use | |
102 | loopback mount, we may address this issue. | |
103 | for instance, | |
104 | mount /cdrom/squashfs.img /sq | |
105 | losetup /sq/ext2.img | |
106 | losetup /somewhere/cow | |
107 | dmsetup "snapshot /dev/loop0 /dev/loop1 ..." | |
108 | - it will be difficult (or needs more operations) to extract the | |
109 | difference between the original device and COW. | |
110 | - DM snapshot-merge may help a lot when users try merging. in the | |
111 | fs-layer union, users will use rsync(1). | |
112 | ||
113 | You may want to read my old paper "Filesystems in LiveCD" | |
114 | (http://aufs.sourceforge.net/aufs2/report/sq/sq.pdf). | |
115 | ||
116 | ||
117 | Several characters/aspects/persona of aufs | |
118 | ---------------------------------------------------------------------- | |
119 | ||
120 | Aufs has several characters, aspects or persona. | |
121 | 1. a filesystem, callee of VFS helper | |
122 | 2. sub-VFS, caller of VFS helper for branches | |
123 | 3. a virtual filesystem which maintains persistent inode number | |
124 | 4. reader/writer of files on branches such like an application | |
125 | ||
126 | 1. Callee of VFS Helper | |
127 | As an ordinary linux filesystem, aufs is a callee of VFS. For instance, | |
128 | unlink(2) from an application reaches sys_unlink() kernel function and | |
129 | then vfs_unlink() is called. vfs_unlink() is one of VFS helper and it | |
130 | calls filesystem specific unlink operation. Actually aufs implements the | |
131 | unlink operation but it behaves like a redirector. | |
132 | ||
133 | 2. Caller of VFS Helper for Branches | |
134 | aufs_unlink() passes the unlink request to the branch filesystem as if | |
135 | it were called from VFS. So the called unlink operation of the branch | |
136 | filesystem acts as usual. As a caller of VFS helper, aufs should handle | |
137 | every necessary pre/post operation for the branch filesystem. | |
138 | - acquire the lock for the parent dir on a branch | |
139 | - lookup in a branch | |
140 | - revalidate dentry on a branch | |
141 | - mnt_want_write() for a branch | |
142 | - vfs_unlink() for a branch | |
143 | - mnt_drop_write() for a branch | |
144 | - release the lock on a branch | |
145 | ||
146 | 3. Persistent Inode Number | |
147 | One of the most important issue for a filesystem is to maintain inode | |
148 | numbers. This is particularly important to support exporting a | |
149 | filesystem via NFS. Aufs is a virtual filesystem which doesn't have a | |
150 | backend block device for its own. But some storage is necessary to | |
151 | keep and maintain the inode numbers. It may be a large space and may not | |
152 | suit to keep in memory. Aufs rents some space from its first writable | |
153 | branch filesystem (by default) and creates file(s) on it. These files | |
154 | are created by aufs internally and removed soon (currently) keeping | |
155 | opened. | |
156 | Note: Because these files are removed, they are totally gone after | |
157 | unmounting aufs. It means the inode numbers are not persistent | |
158 | across unmount or reboot. I have a plan to make them really | |
159 | persistent which will be important for aufs on NFS server. | |
160 | ||
161 | 4. Read/Write Files Internally (copy-on-write) | |
162 | Because a branch can be readonly, when you write a file on it, aufs will | |
163 | "copy-up" it to the upper writable branch internally. And then write the | |
164 | originally requested thing to the file. Generally kernel doesn't | |
165 | open/read/write file actively. In aufs, even a single write may cause a | |
166 | internal "file copy". This behaviour is very similar to cp(1) command. | |
167 | ||
168 | Some people may think it is better to pass such work to user space | |
169 | helper, instead of doing in kernel space. Actually I am still thinking | |
170 | about it. But currently I have implemented it in kernel space. |