]> git.proxmox.com Git - mirror_ubuntu-artful-kernel.git/blob - Documentation/filesystems/aufs/design/02struct.txt
UBUNTU: SAUCE: Import aufs driver
[mirror_ubuntu-artful-kernel.git] / Documentation / filesystems / aufs / design / 02struct.txt
1
2 # Copyright (C) 2005-2017 Junjiro R. Okajima
3 #
4 # This program is free software; you can redistribute it and/or modify
5 # it under the terms of the GNU General Public License as published by
6 # the Free Software Foundation; either version 2 of the License, or
7 # (at your option) any later version.
8 #
9 # This program is distributed in the hope that it will be useful,
10 # but WITHOUT ANY WARRANTY; without even the implied warranty of
11 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12 # GNU General Public License for more details.
13 #
14 # You should have received a copy of the GNU General Public License
15 # along with this program. If not, see <http://www.gnu.org/licenses/>.
16
17 Basic Aufs Internal Structure
18
19 Superblock/Inode/Dentry/File Objects
20 ----------------------------------------------------------------------
21 As like an ordinary filesystem, aufs has its own
22 superblock/inode/dentry/file objects. All these objects have a
23 dynamically allocated array and store the same kind of pointers to the
24 lower filesystem, branch.
25 For example, when you build a union with one readwrite branch and one
26 readonly, mounted /au, /rw and /ro respectively.
27 - /au = /rw + /ro
28 - /ro/fileA exists but /rw/fileA
29
30 Aufs lookup operation finds /ro/fileA and gets dentry for that. These
31 pointers are stored in a aufs dentry. The array in aufs dentry will be,
32 - [0] = NULL (because /rw/fileA doesn't exist)
33 - [1] = /ro/fileA
34
35 This style of an array is essentially same to the aufs
36 superblock/inode/dentry/file objects.
37
38 Because aufs supports manipulating branches, ie. add/delete/change
39 branches dynamically, these objects has its own generation. When
40 branches are changed, the generation in aufs superblock is
41 incremented. And a generation in other object are compared when it is
42 accessed. When a generation in other objects are obsoleted, aufs
43 refreshes the internal array.
44
45
46 Superblock
47 ----------------------------------------------------------------------
48 Additionally aufs superblock has some data for policies to select one
49 among multiple writable branches, XIB files, pseudo-links and kobject.
50 See below in detail.
51 About the policies which supports copy-down a directory, see
52 wbr_policy.txt too.
53
54
55 Branch and XINO(External Inode Number Translation Table)
56 ----------------------------------------------------------------------
57 Every branch has its own xino (external inode number translation table)
58 file. The xino file is created and unlinked by aufs internally. When two
59 members of a union exist on the same filesystem, they share the single
60 xino file.
61 The struct of a xino file is simple, just a sequence of aufs inode
62 numbers which is indexed by the lower inode number.
63 In the above sample, assume the inode number of /ro/fileA is i111 and
64 aufs assigns the inode number i999 for fileA. Then aufs writes 999 as
65 4(8) bytes at 111 * 4(8) bytes offset in the xino file.
66
67 When the inode numbers are not contiguous, the xino file will be sparse
68 which has a hole in it and doesn't consume as much disk space as it
69 might appear. If your branch filesystem consumes disk space for such
70 holes, then you should specify 'xino=' option at mounting aufs.
71
72 Aufs has a mount option to free the disk blocks for such holes in XINO
73 files on tmpfs or ramdisk. But it is not so effective actually. If you
74 meet a problem of disk shortage due to XINO files, then you should try
75 "tmpfs-ino.patch" (and "vfs-ino.patch" too) in aufs4-standalone.git.
76 The patch localizes the assignment inumbers per tmpfs-mount and avoid
77 the holes in XINO files.
78
79 Also a writable branch has three kinds of "whiteout bases". All these
80 are existed when the branch is joined to aufs, and their names are
81 whiteout-ed doubly, so that users will never see their names in aufs
82 hierarchy.
83 1. a regular file which will be hardlinked to all whiteouts.
84 2. a directory to store a pseudo-link.
85 3. a directory to store an "orphan"-ed file temporary.
86
87 1. Whiteout Base
88 When you remove a file on a readonly branch, aufs handles it as a
89 logical deletion and creates a whiteout on the upper writable branch
90 as a hardlink of this file in order not to consume inode on the
91 writable branch.
92 2. Pseudo-link Dir
93 See below, Pseudo-link.
94 3. Step-Parent Dir
95 When "fileC" exists on the lower readonly branch only and it is
96 opened and removed with its parent dir, and then user writes
97 something into it, then aufs copies-up fileC to this
98 directory. Because there is no other dir to store fileC. After
99 creating a file under this dir, the file is unlinked.
100
101 Because aufs supports manipulating branches, ie. add/delete/change
102 dynamically, a branch has its own id. When the branch order changes,
103 aufs finds the new index by searching the branch id.
104
105
106 Pseudo-link
107 ----------------------------------------------------------------------
108 Assume "fileA" exists on the lower readonly branch only and it is
109 hardlinked to "fileB" on the branch. When you write something to fileA,
110 aufs copies-up it to the upper writable branch. Additionally aufs
111 creates a hardlink under the Pseudo-link Directory of the writable
112 branch. The inode of a pseudo-link is kept in aufs super_block as a
113 simple list. If fileB is read after unlinking fileA, aufs returns
114 filedata from the pseudo-link instead of the lower readonly
115 branch. Because the pseudo-link is based upon the inode, to keep the
116 inode number by xino (see above) is essentially necessary.
117
118 All the hardlinks under the Pseudo-link Directory of the writable branch
119 should be restored in a proper location later. Aufs provides a utility
120 to do this. The userspace helpers executed at remounting and unmounting
121 aufs by default.
122 During this utility is running, it puts aufs into the pseudo-link
123 maintenance mode. In this mode, only the process which began the
124 maintenance mode (and its child processes) is allowed to operate in
125 aufs. Some other processes which are not related to the pseudo-link will
126 be allowed to run too, but the rest have to return an error or wait
127 until the maintenance mode ends. If a process already acquires an inode
128 mutex (in VFS), it has to return an error.
129
130
131 XIB(external inode number bitmap)
132 ----------------------------------------------------------------------
133 Addition to the xino file per a branch, aufs has an external inode number
134 bitmap in a superblock object. It is also an internal file such like a
135 xino file.
136 It is a simple bitmap to mark whether the aufs inode number is in-use or
137 not.
138 To reduce the file I/O, aufs prepares a single memory page to cache xib.
139
140 As well as XINO files, aufs has a feature to truncate/refresh XIB to
141 reduce the number of consumed disk blocks for these files.
142
143
144 Virtual or Vertical Dir, and Readdir in Userspace
145 ----------------------------------------------------------------------
146 In order to support multiple layers (branches), aufs readdir operation
147 constructs a virtual dir block on memory. For readdir, aufs calls
148 vfs_readdir() internally for each dir on branches, merges their entries
149 with eliminating the whiteout-ed ones, and sets it to file (dir)
150 object. So the file object has its entry list until it is closed. The
151 entry list will be updated when the file position is zero and becomes
152 obsoleted. This decision is made in aufs automatically.
153
154 The dynamically allocated memory block for the name of entries has a
155 unit of 512 bytes (by default) and stores the names contiguously (no
156 padding). Another block for each entry is handled by kmem_cache too.
157 During building dir blocks, aufs creates hash list and judging whether
158 the entry is whiteouted by its upper branch or already listed.
159 The merged result is cached in the corresponding inode object and
160 maintained by a customizable life-time option.
161
162 Some people may call it can be a security hole or invite DoS attack
163 since the opened and once readdir-ed dir (file object) holds its entry
164 list and becomes a pressure for system memory. But I'd say it is similar
165 to files under /proc or /sys. The virtual files in them also holds a
166 memory page (generally) while they are opened. When an idea to reduce
167 memory for them is introduced, it will be applied to aufs too.
168 For those who really hate this situation, I've developed readdir(3)
169 library which operates this merging in userspace. You just need to set
170 LD_PRELOAD environment variable, and aufs will not consume no memory in
171 kernel space for readdir(3).
172
173
174 Workqueue
175 ----------------------------------------------------------------------
176 Aufs sometimes requires privilege access to a branch. For instance,
177 in copy-up/down operation. When a user process is going to make changes
178 to a file which exists in the lower readonly branch only, and the mode
179 of one of ancestor directories may not be writable by a user
180 process. Here aufs copy-up the file with its ancestors and they may
181 require privilege to set its owner/group/mode/etc.
182 This is a typical case of a application character of aufs (see
183 Introduction).
184
185 Aufs uses workqueue synchronously for this case. It creates its own
186 workqueue. The workqueue is a kernel thread and has privilege. Aufs
187 passes the request to call mkdir or write (for example), and wait for
188 its completion. This approach solves a problem of a signal handler
189 simply.
190 If aufs didn't adopt the workqueue and changed the privilege of the
191 process, then the process may receive the unexpected SIGXFSZ or other
192 signals.
193
194 Also aufs uses the system global workqueue ("events" kernel thread) too
195 for asynchronous tasks, such like handling inotify/fsnotify, re-creating a
196 whiteout base and etc. This is unrelated to a privilege.
197 Most of aufs operation tries acquiring a rw_semaphore for aufs
198 superblock at the beginning, at the same time waits for the completion
199 of all queued asynchronous tasks.
200
201
202 Whiteout
203 ----------------------------------------------------------------------
204 The whiteout in aufs is very similar to Unionfs's. That is represented
205 by its filename. UnionMount takes an approach of a file mode, but I am
206 afraid several utilities (find(1) or something) will have to support it.
207
208 Basically the whiteout represents "logical deletion" which stops aufs to
209 lookup further, but also it represents "dir is opaque" which also stop
210 further lookup.
211
212 In aufs, rmdir(2) and rename(2) for dir uses whiteout alternatively.
213 In order to make several functions in a single systemcall to be
214 revertible, aufs adopts an approach to rename a directory to a temporary
215 unique whiteouted name.
216 For example, in rename(2) dir where the target dir already existed, aufs
217 renames the target dir to a temporary unique whiteouted name before the
218 actual rename on a branch, and then handles other actions (make it opaque,
219 update the attributes, etc). If an error happens in these actions, aufs
220 simply renames the whiteouted name back and returns an error. If all are
221 succeeded, aufs registers a function to remove the whiteouted unique
222 temporary name completely and asynchronously to the system global
223 workqueue.
224
225
226 Copy-up
227 ----------------------------------------------------------------------
228 It is a well-known feature or concept.
229 When user modifies a file on a readonly branch, aufs operate "copy-up"
230 internally and makes change to the new file on the upper writable branch.
231 When the trigger systemcall does not update the timestamps of the parent
232 dir, aufs reverts it after copy-up.
233
234
235 Move-down (aufs3.9 and later)
236 ----------------------------------------------------------------------
237 "Copy-up" is one of the essential feature in aufs. It copies a file from
238 the lower readonly branch to the upper writable branch when a user
239 changes something about the file.
240 "Move-down" is an opposite action of copy-up. Basically this action is
241 ran manually instead of automatically and internally.
242 For desgin and implementation, aufs has to consider these issues.
243 - whiteout for the file may exist on the lower branch.
244 - ancestor directories may not exist on the lower branch.
245 - diropq for the ancestor directories may exist on the upper branch.
246 - free space on the lower branch will reduce.
247 - another access to the file may happen during moving-down, including
248 UDBA (see "Revalidate Dentry and UDBA").
249 - the file should not be hard-linked nor pseudo-linked. they should be
250 handled by auplink utility later.
251
252 Sometimes users want to move-down a file from the upper writable branch
253 to the lower readonly or writable branch. For instance,
254 - the free space of the upper writable branch is going to run out.
255 - create a new intermediate branch between the upper and lower branch.
256 - etc.
257
258 For this purpose, use "aumvdown" command in aufs-util.git.