]>
Commit | Line | Data |
---|---|---|
6e29ad2e MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
1da177e4 LT |
3 | |
4 | The Second Extended Filesystem | |
5 | ============================== | |
6 | ||
7 | ext2 was originally released in January 1993. Written by R\'emy Card, | |
8 | Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the | |
9 | Extended Filesystem. It is currently still (April 2001) the predominant | |
10 | filesystem in use by Linux. There are also implementations available | |
11 | for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS. | |
12 | ||
13 | Options | |
14 | ======= | |
15 | ||
16 | Most defaults are determined by the filesystem superblock, and can be | |
17 | set using tune2fs(8). Kernel-determined defaults are indicated by (*). | |
18 | ||
6e29ad2e MCC |
19 | ==================== === ================================================ |
20 | bsddf (*) Makes ``df`` act like BSD. | |
21 | minixdf Makes ``df`` act like Minix. | |
1da177e4 | 22 | |
1da177e4 LT |
23 | check=none, nocheck (*) Don't do extra checking of bitmaps on mount |
24 | (check=normal and check=strict options removed) | |
25 | ||
9c3ce9ec MW |
26 | dax Use direct access (no page cache). See |
27 | Documentation/filesystems/dax.txt. | |
28 | ||
1da177e4 LT |
29 | debug Extra debugging information is sent to the |
30 | kernel syslog. Useful for developers. | |
31 | ||
32 | errors=continue Keep going on a filesystem error. | |
33 | errors=remount-ro Remount the filesystem read-only on an error. | |
34 | errors=panic Panic and halt the machine if an error occurs. | |
35 | ||
36 | grpid, bsdgroups Give objects the same group ID as their parent. | |
37 | nogrpid, sysvgroups New objects have the group ID of their creator. | |
38 | ||
39 | nouid32 Use 16-bit UIDs and GIDs. | |
40 | ||
41 | oldalloc Enable the old block allocator. Orlov should | |
42 | have better performance, we'd like to get some | |
43 | feedback if it's the contrary for you. | |
44 | orlov (*) Use the Orlov block allocator. | |
45 | (See http://lwn.net/Articles/14633/ and | |
46 | http://lwn.net/Articles/14446/.) | |
47 | ||
48 | resuid=n The user ID which may use the reserved blocks. | |
49 | resgid=n The group ID which may use the reserved blocks. | |
50 | ||
51 | sb=n Use alternate superblock at this location. | |
52 | ||
53 | user_xattr Enable "user." POSIX Extended Attributes | |
54 | (requires CONFIG_EXT2_FS_XATTR). | |
1da177e4 LT |
55 | nouser_xattr Don't support "user." extended attributes. |
56 | ||
57 | acl Enable POSIX Access Control Lists support | |
58 | (requires CONFIG_EXT2_FS_POSIX_ACL). | |
1da177e4 LT |
59 | noacl Don't support POSIX ACLs. |
60 | ||
61 | nobh Do not attach buffer_heads to file pagecache. | |
62 | ||
e15d92be CX |
63 | quota, usrquota Enable user disk quota support |
64 | (requires CONFIG_QUOTA). | |
65 | ||
66 | grpquota Enable group disk quota support | |
67 | (requires CONFIG_QUOTA). | |
6e29ad2e | 68 | ==================== === ================================================ |
e15d92be CX |
69 | |
70 | noquota option ls silently ignored by ext2. | |
1da177e4 LT |
71 | |
72 | ||
73 | Specification | |
74 | ============= | |
75 | ||
76 | ext2 shares many properties with traditional Unix filesystems. It has | |
77 | the concepts of blocks, inodes and directories. It has space in the | |
78 | specification for Access Control Lists (ACLs), fragments, undeletion and | |
79 | compression though these are not yet implemented (some are available as | |
80 | separate patches). There is also a versioning mechanism to allow new | |
81 | features (such as journalling) to be added in a maximally compatible | |
82 | manner. | |
83 | ||
84 | Blocks | |
85 | ------ | |
86 | ||
87 | The space in the device or file is split up into blocks. These are | |
88 | a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems), | |
89 | which is decided when the filesystem is created. Smaller blocks mean | |
90 | less wasted space per file, but require slightly more accounting overhead, | |
91 | and also impose other limits on the size of files and the filesystem. | |
92 | ||
93 | Block Groups | |
94 | ------------ | |
95 | ||
96 | Blocks are clustered into block groups in order to reduce fragmentation | |
97 | and minimise the amount of head seeking when reading a large amount | |
98 | of consecutive data. Information about each block group is kept in a | |
99 | descriptor table stored in the block(s) immediately after the superblock. | |
100 | Two blocks near the start of each group are reserved for the block usage | |
101 | bitmap and the inode usage bitmap which show which blocks and inodes | |
102 | are in use. Since each bitmap is limited to a single block, this means | |
103 | that the maximum size of a block group is 8 times the size of a block. | |
104 | ||
105 | The block(s) following the bitmaps in each block group are designated | |
106 | as the inode table for that block group and the remainder are the data | |
107 | blocks. The block allocation algorithm attempts to allocate data blocks | |
108 | in the same block group as the inode which contains them. | |
109 | ||
110 | The Superblock | |
111 | -------------- | |
112 | ||
113 | The superblock contains all the information about the configuration of | |
114 | the filing system. The primary copy of the superblock is stored at an | |
115 | offset of 1024 bytes from the start of the device, and it is essential | |
116 | to mounting the filesystem. Since it is so important, backup copies of | |
117 | the superblock are stored in block groups throughout the filesystem. | |
118 | The first version of ext2 (revision 0) stores a copy at the start of | |
119 | every block group, along with backups of the group descriptor block(s). | |
120 | Because this can consume a considerable amount of space for large | |
121 | filesystems, later revisions can optionally reduce the number of backup | |
122 | copies by only putting backups in specific groups (this is the sparse | |
123 | superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7. | |
124 | ||
125 | The information in the superblock contains fields such as the total | |
126 | number of inodes and blocks in the filesystem and how many are free, | |
127 | how many inodes and blocks are in each block group, when the filesystem | |
128 | was mounted (and if it was cleanly unmounted), when it was modified, | |
129 | what version of the filesystem it is (see the Revisions section below) | |
130 | and which OS created it. | |
131 | ||
132 | If the filesystem is revision 1 or higher, then there are extra fields, | |
133 | such as a volume name, a unique identification number, the inode size, | |
134 | and space for optional filesystem features to store configuration info. | |
135 | ||
136 | All fields in the superblock (as in all other ext2 structures) are stored | |
137 | on the disc in little endian format, so a filesystem is portable between | |
138 | machines without having to know what machine it was created on. | |
139 | ||
140 | Inodes | |
141 | ------ | |
142 | ||
143 | The inode (index node) is a fundamental concept in the ext2 filesystem. | |
144 | Each object in the filesystem is represented by an inode. The inode | |
145 | structure contains pointers to the filesystem blocks which contain the | |
146 | data held in the object and all of the metadata about an object except | |
147 | its name. The metadata about an object includes the permissions, owner, | |
148 | group, flags, size, number of blocks used, access time, change time, | |
149 | modification time, deletion time, number of links, fragments, version | |
150 | (for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs). | |
151 | ||
152 | There are some reserved fields which are currently unused in the inode | |
153 | structure and several which are overloaded. One field is reserved for the | |
154 | directory ACL if the inode is a directory and alternately for the top 32 | |
155 | bits of the file size if the inode is a regular file (allowing file sizes | |
156 | larger than 2GB). The translator field is unused under Linux, but is used | |
157 | by the HURD to reference the inode of a program which will be used to | |
158 | interpret this object. Most of the remaining reserved fields have been | |
159 | used up for both Linux and the HURD for larger owner and group fields, | |
160 | The HURD also has a larger mode field so it uses another of the remaining | |
161 | fields to store the extra more bits. | |
162 | ||
163 | There are pointers to the first 12 blocks which contain the file's data | |
164 | in the inode. There is a pointer to an indirect block (which contains | |
165 | pointers to the next set of blocks), a pointer to a doubly-indirect | |
166 | block (which contains pointers to indirect blocks) and a pointer to a | |
167 | trebly-indirect block (which contains pointers to doubly-indirect blocks). | |
168 | ||
169 | The flags field contains some ext2-specific flags which aren't catered | |
170 | for by the standard chmod flags. These flags can be listed with lsattr | |
171 | and changed with the chattr command, and allow specific filesystem | |
172 | behaviour on a per-file basis. There are flags for secure deletion, | |
173 | undeletable, compression, synchronous updates, immutability, append-only, | |
174 | dumpable, no-atime, indexed directories, and data-journaling. Not all | |
175 | of these are supported yet. | |
176 | ||
177 | Directories | |
178 | ----------- | |
179 | ||
180 | A directory is a filesystem object and has an inode just like a file. | |
181 | It is a specially formatted file containing records which associate | |
182 | each name with an inode number. Later revisions of the filesystem also | |
183 | encode the type of the object (file, directory, symlink, device, fifo, | |
184 | socket) to avoid the need to check the inode itself for this information | |
185 | (support for taking advantage of this feature does not yet exist in | |
186 | Glibc 2.2). | |
187 | ||
188 | The inode allocation code tries to assign inodes which are in the same | |
189 | block group as the directory in which they are first created. | |
190 | ||
191 | The current implementation of ext2 uses a singly-linked list to store | |
192 | the filenames in the directory; a pending enhancement uses hashing of the | |
193 | filenames to allow lookup without the need to scan the entire directory. | |
194 | ||
195 | The current implementation never removes empty directory blocks once they | |
196 | have been allocated to hold more files. | |
197 | ||
198 | Special files | |
199 | ------------- | |
200 | ||
201 | Symbolic links are also filesystem objects with inodes. They deserve | |
202 | special mention because the data for them is stored within the inode | |
203 | itself if the symlink is less than 60 bytes long. It uses the fields | |
204 | which would normally be used to store the pointers to data blocks. | |
205 | This is a worthwhile optimisation as it we avoid allocating a full | |
206 | block for the symlink, and most symlinks are less than 60 characters long. | |
207 | ||
208 | Character and block special devices never have data blocks assigned to | |
209 | them. Instead, their device number is stored in the inode, again reusing | |
210 | the fields which would be used to point to the data blocks. | |
211 | ||
212 | Reserved Space | |
213 | -------------- | |
214 | ||
215 | In ext2, there is a mechanism for reserving a certain number of blocks | |
216 | for a particular user (normally the super-user). This is intended to | |
992caacf | 217 | allow for the system to continue functioning even if non-privileged users |
1da177e4 LT |
218 | fill up all the space available to them (this is independent of filesystem |
219 | quotas). It also keeps the filesystem from filling up entirely which | |
220 | helps combat fragmentation. | |
221 | ||
222 | Filesystem check | |
223 | ---------------- | |
224 | ||
225 | At boot time, most systems run a consistency check (e2fsck) on their | |
226 | filesystems. The superblock of the ext2 filesystem contains several | |
227 | fields which indicate whether fsck should actually run (since checking | |
228 | the filesystem at boot can take a long time if it is large). fsck will | |
229 | run if the filesystem was not cleanly unmounted, if the maximum mount | |
230 | count has been exceeded or if the maximum time between checks has been | |
231 | exceeded. | |
232 | ||
233 | Feature Compatibility | |
234 | --------------------- | |
235 | ||
236 | The compatibility feature mechanism used in ext2 is sophisticated. | |
237 | It safely allows features to be added to the filesystem, without | |
238 | unnecessarily sacrificing compatibility with older versions of the | |
239 | filesystem code. The feature compatibility mechanism is not supported by | |
240 | the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in | |
241 | revision 1. There are three 32-bit fields, one for compatible features | |
242 | (COMPAT), one for read-only compatible (RO_COMPAT) features and one for | |
243 | incompatible (INCOMPAT) features. | |
244 | ||
245 | These feature flags have specific meanings for the kernel as follows: | |
246 | ||
247 | A COMPAT flag indicates that a feature is present in the filesystem, | |
248 | but the on-disk format is 100% compatible with older on-disk formats, so | |
249 | a kernel which didn't know anything about this feature could read/write | |
250 | the filesystem without any chance of corrupting the filesystem (or even | |
251 | making it inconsistent). This is essentially just a flag which says | |
252 | "this filesystem has a (hidden) feature" that the kernel or e2fsck may | |
253 | want to be aware of (more on e2fsck and feature flags later). The ext3 | |
254 | HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply | |
255 | a regular file with data blocks in it so the kernel does not need to | |
256 | take any special notice of it if it doesn't understand ext3 journaling. | |
257 | ||
258 | An RO_COMPAT flag indicates that the on-disk format is 100% compatible | |
259 | with older on-disk formats for reading (i.e. the feature does not change | |
260 | the visible on-disk format). However, an old kernel writing to such a | |
261 | filesystem would/could corrupt the filesystem, so this is prevented. The | |
262 | most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because | |
263 | sparse groups allow file data blocks where superblock/group descriptor | |
264 | backups used to live, and ext2_free_blocks() refuses to free these blocks, | |
265 | which would leading to inconsistent bitmaps. An old kernel would also | |
266 | get an error if it tried to free a series of blocks which crossed a group | |
267 | boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem. | |
268 | ||
269 | An INCOMPAT flag indicates the on-disk format has changed in some | |
270 | way that makes it unreadable by older kernels, or would otherwise | |
271 | cause a problem if an old kernel tried to mount it. FILETYPE is an | |
272 | INCOMPAT flag because older kernels would think a filename was longer | |
273 | than 256 characters, which would lead to corrupt directory listings. | |
274 | The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel | |
275 | doesn't understand compression, you would just get garbage back from | |
276 | read() instead of it automatically decompressing your data. The ext3 | |
277 | RECOVER flag is needed to prevent a kernel which does not understand the | |
278 | ext3 journal from mounting the filesystem without replaying the journal. | |
279 | ||
280 | For e2fsck, it needs to be more strict with the handling of these | |
281 | flags than the kernel. If it doesn't understand ANY of the COMPAT, | |
282 | RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem, | |
283 | because it has no way of verifying whether a given feature is valid | |
284 | or not. Allowing e2fsck to succeed on a filesystem with an unknown | |
285 | feature is a false sense of security for the user. Refusing to check | |
286 | a filesystem with unknown features is a good incentive for the user to | |
287 | update to the latest e2fsck. This also means that anyone adding feature | |
288 | flags to ext2 also needs to update e2fsck to verify these features. | |
289 | ||
290 | Metadata | |
291 | -------- | |
292 | ||
293 | It is frequently claimed that the ext2 implementation of writing | |
294 | asynchronous metadata is faster than the ffs synchronous metadata | |
295 | scheme but less reliable. Both methods are equally resolvable by their | |
296 | respective fsck programs. | |
297 | ||
298 | If you're exceptionally paranoid, there are 3 ways of making metadata | |
299 | writes synchronous on ext2: | |
300 | ||
6e29ad2e MCC |
301 | - per-file if you have the program source: use the O_SYNC flag to open() |
302 | - per-file if you don't have the source: use "chattr +S" on the file | |
303 | - per-filesystem: add the "sync" option to mount (or in /etc/fstab) | |
1da177e4 LT |
304 | |
305 | the first and last are not ext2 specific but do force the metadata to | |
306 | be written synchronously. See also Journaling below. | |
307 | ||
308 | Limitations | |
309 | ----------- | |
310 | ||
311 | There are various limits imposed by the on-disk layout of ext2. Other | |
312 | limits are imposed by the current implementation of the kernel code. | |
313 | Many of the limits are determined at the time the filesystem is first | |
314 | created, and depend upon the block size chosen. The ratio of inodes to | |
315 | data blocks is fixed at filesystem creation time, so the only way to | |
316 | increase the number of inodes is to increase the size of the filesystem. | |
317 | No tools currently exist which can change the ratio of inodes to blocks. | |
318 | ||
319 | Most of these limits could be overcome with slight changes in the on-disk | |
320 | format and using a compatibility flag to signal the format change (at | |
321 | the expense of some compatibility). | |
322 | ||
6e29ad2e MCC |
323 | ===================== ======= ======= ======= ======== |
324 | Filesystem block size 1kB 2kB 4kB 8kB | |
325 | ===================== ======= ======= ======= ======== | |
326 | File size limit 16GB 256GB 2048GB 2048GB | |
327 | Filesystem size limit 2047GB 8192GB 16384GB 32768GB | |
328 | ===================== ======= ======= ======= ======== | |
1da177e4 LT |
329 | |
330 | There is a 2.4 kernel limit of 2048GB for a single block device, so no | |
331 | filesystem larger than that can be created at this time. There is also | |
332 | an upper limit on the block size imposed by the page size of the kernel, | |
333 | so 8kB blocks are only allowed on Alpha systems (and other architectures | |
334 | which support larger pages). | |
335 | ||
ce05b2a9 | 336 | There is an upper limit of 32000 subdirectories in a single directory. |
1da177e4 LT |
337 | |
338 | There is a "soft" upper limit of about 10-15k files in a single directory | |
339 | with the current linear linked-list directory implementation. This limit | |
340 | stems from performance problems when creating and deleting (and also | |
341 | finding) files in such large directories. Using a hashed directory index | |
342 | (under development) allows 100k-1M+ files in a single directory without | |
343 | performance problems (although RAM size becomes an issue at this point). | |
344 | ||
345 | The (meaningless) absolute upper limit of files in a single directory | |
346 | (imposed by the file size, the realistic limit is obviously much less) | |
347 | is over 130 trillion files. It would be higher except there are not | |
348 | enough 4-character names to make up unique directory entries, so they | |
349 | have to be 8 character filenames, even then we are fairly close to | |
350 | running out of unique filenames. | |
351 | ||
352 | Journaling | |
353 | ---------- | |
354 | ||
355 | A journaling extension to the ext2 code has been developed by Stephen | |
356 | Tweedie. It avoids the risks of metadata corruption and the need to | |
357 | wait for e2fsck to complete after a crash, without requiring a change | |
358 | to the on-disk ext2 layout. In a nutshell, the journal is a regular | |
359 | file which stores whole metadata (and optionally data) blocks that have | |
360 | been modified, prior to writing them into the filesystem. This means | |
361 | it is possible to add a journal to an existing ext2 filesystem without | |
362 | the need for data conversion. | |
363 | ||
364 | When changes to the filesystem (e.g. a file is renamed) they are stored in | |
365 | a transaction in the journal and can either be complete or incomplete at | |
366 | the time of a crash. If a transaction is complete at the time of a crash | |
367 | (or in the normal case where the system does not crash), then any blocks | |
368 | in that transaction are guaranteed to represent a valid filesystem state, | |
369 | and are copied into the filesystem. If a transaction is incomplete at | |
370 | the time of the crash, then there is no guarantee of consistency for | |
371 | the blocks in that transaction so they are discarded (which means any | |
372 | filesystem changes they represent are also lost). | |
93fb7f19 | 373 | Check Documentation/filesystems/ext4/ if you want to read more about |
c290ea01 | 374 | ext4 and journaling. |
1da177e4 LT |
375 | |
376 | References | |
377 | ========== | |
378 | ||
6e29ad2e | 379 | ======================= =============================================== |
1da177e4 LT |
380 | The kernel source file:/usr/src/linux/fs/ext2/ |
381 | e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/ | |
382 | Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html | |
383 | Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/ | |
1da177e4 | 384 | Filesystem Resizing http://ext2resize.sourceforge.net/ |
6e29ad2e MCC |
385 | Compression [1]_ http://e2compr.sourceforge.net/ |
386 | ======================= =============================================== | |
1da177e4 LT |
387 | |
388 | Implementations for: | |
6e29ad2e MCC |
389 | |
390 | ======================= =========================================================== | |
ab03eca8 | 391 | Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs |
6e29ad2e MCC |
392 | Windows 95 [1]_ http://www.yipton.net/content.html#FSDEXT2 |
393 | DOS client [1]_ ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ | |
394 | OS/2 [2]_ ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ | |
ab03eca8 | 395 | RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/ |
6e29ad2e | 396 | ======================= =========================================================== |
1da177e4 | 397 | |
6e29ad2e MCC |
398 | .. [1] no longer actively developed/supported (as of Apr 2001) |
399 | .. [2] no longer actively developed/supported (as of Mar 2009) |