]>
Commit | Line | Data |
---|---|---|
fc513a33 DK |
1 | |
2 | Ext4 Filesystem | |
3 | =============== | |
4 | ||
5 | This is a development version of the ext4 filesystem, an advanced level | |
6 | of the ext3 filesystem which incorporates scalability and reliability | |
7 | enhancements for supporting large filesystems (64 bit) in keeping with | |
8 | increasing disk capacities and state-of-the-art feature requirements. | |
9 | ||
10 | Mailing list: linux-ext4@vger.kernel.org | |
11 | ||
12 | ||
13 | 1. Quick usage instructions: | |
14 | =========================== | |
15 | ||
16 | - Grab updated e2fsprogs from | |
17 | ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/ | |
18 | This is a patchset on top of e2fsprogs-1.39, which can be found at | |
19 | ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/ | |
20 | ||
21 | - It's still mke2fs -j /dev/hda1 | |
22 | ||
23 | - mount /dev/hda1 /wherever -t ext4dev | |
24 | ||
25 | - To enable extents, | |
26 | ||
27 | mount /dev/hda1 /wherever -t ext4dev -o extents | |
28 | ||
29 | - The filesystem is compatible with the ext3 driver until you add a file | |
30 | which has extents (ie: `mount -o extents', then create a file). | |
31 | ||
32 | NOTE: The "extents" mount flag is temporary. It will soon go away and | |
33 | extents will be enabled by the "-o extents" flag to mke2fs or tune2fs | |
34 | ||
35 | - When comparing performance with other filesystems, remember that | |
36 | ext3/4 by default offers higher data integrity guarantees than most. So | |
37 | when comparing with a metadata-only journalling filesystem, use `mount -o | |
38 | data=writeback'. And you might as well use `mount -o nobh' too along | |
39 | with it. Making the journal larger than the mke2fs default often helps | |
40 | performance with metadata-intensive workloads. | |
41 | ||
42 | 2. Features | |
43 | =========== | |
44 | ||
45 | 2.1 Currently available | |
46 | ||
47 | * ability to use filesystems > 16TB | |
48 | * extent format reduces metadata overhead (RAM, IO for access, transactions) | |
49 | * extent format more robust in face of on-disk corruption due to magics, | |
50 | * internal redunancy in tree | |
51 | ||
52 | 2.1 Previously available, soon to be enabled by default by "mkefs.ext4": | |
53 | ||
54 | * dir_index and resize inode will be on by default | |
55 | * large inodes will be used by default for fast EAs, nsec timestamps, etc | |
56 | ||
57 | 2.2 Candidate features for future inclusion | |
58 | ||
59 | There are several under discussion, whether they all make it in is | |
60 | partly a function of how much time everyone has to work on them: | |
61 | ||
62 | * improved file allocation (multi-block alloc, delayed alloc; basically done) | |
63 | * fix 32000 subdirectory limit (patch exists, needs some e2fsck work) | |
64 | * nsec timestamps for mtime, atime, ctime, create time (patch exists, | |
65 | needs some e2fsck work) | |
66 | * inode version field on disk (NFSv4, Lustre; prototype exists) | |
67 | * reduced mke2fs/e2fsck time via uninitialized groups (prototype exists) | |
68 | * journal checksumming for robustness, performance (prototype exists) | |
69 | * persistent file preallocation (e.g for streaming media, databases) | |
70 | ||
71 | Features like metadata checksumming have been discussed and planned for | |
72 | a bit but no patches exist yet so I'm not sure they're in the near-term | |
73 | roadmap. | |
74 | ||
75 | The big performance win will come with mballoc and delalloc. CFS has | |
76 | been using mballoc for a few years already with Lustre, and IBM + Bull | |
77 | did a lot of benchmarking on it. The reason it isn't in the first set of | |
78 | patches is partly a manageability issue, and partly because it doesn't | |
79 | directly affect the on-disk format (outside of much better allocation) | |
80 | so it isn't critical to get into the first round of changes. I believe | |
81 | Alex is working on a new set of patches right now. | |
82 | ||
83 | 3. Options | |
84 | ========== | |
85 | ||
86 | When mounting an ext4 filesystem, the following option are accepted: | |
87 | (*) == default | |
88 | ||
c9de560d | 89 | extents (*) ext4 will use extents to address file data. The |
fc513a33 DK |
90 | file system will no longer be mountable by ext3. |
91 | ||
c9de560d AT |
92 | noextents ext4 will not use extents for newly created files |
93 | ||
818d276c GS |
94 | journal_checksum Enable checksumming of the journal transactions. |
95 | This will allow the recovery code in e2fsck and the | |
96 | kernel to detect corruption in the kernel. It is a | |
97 | compatible change and will be ignored by older kernels. | |
98 | ||
99 | journal_async_commit Commit block can be written to disk without waiting | |
100 | for descriptor blocks. If enabled older kernels cannot | |
101 | mount the device. This will enable 'journal_checksum' | |
102 | internally. | |
103 | ||
fc513a33 DK |
104 | journal=update Update the ext4 file system's journal to the current |
105 | format. | |
106 | ||
107 | journal=inum When a journal already exists, this option is ignored. | |
108 | Otherwise, it specifies the number of the inode which | |
109 | will represent the ext4 file system's journal file. | |
110 | ||
111 | journal_dev=devnum When the external journal device's major/minor numbers | |
112 | have changed, this option allows the user to specify | |
113 | the new journal location. The journal device is | |
114 | identified through its new major/minor numbers encoded | |
115 | in devnum. | |
116 | ||
117 | noload Don't load the journal on mounting. | |
118 | ||
119 | data=journal All data are committed into the journal prior to being | |
120 | written into the main file system. | |
121 | ||
122 | data=ordered (*) All data are forced directly out to the main file | |
123 | system prior to its metadata being committed to the | |
124 | journal. | |
125 | ||
126 | data=writeback Data ordering is not preserved, data may be written | |
127 | into the main file system after its metadata has been | |
128 | committed to the journal. | |
129 | ||
130 | commit=nrsec (*) Ext4 can be told to sync all its data and metadata | |
131 | every 'nrsec' seconds. The default value is 5 seconds. | |
132 | This means that if you lose your power, you will lose | |
133 | as much as the latest 5 seconds of work (your | |
134 | filesystem will not be damaged though, thanks to the | |
135 | journaling). This default value (or any low value) | |
136 | will hurt performance, but it's good for data-safety. | |
137 | Setting it to 0 will have the same effect as leaving | |
138 | it at the default (5 seconds). | |
139 | Setting it to very large values will improve | |
140 | performance. | |
141 | ||
571640ca ES |
142 | barrier=<0|1(*)> This enables/disables the use of write barriers in |
143 | the jbd code. barrier=0 disables, barrier=1 enables. | |
144 | This also requires an IO stack which can support | |
145 | barriers, and if jbd gets an error on a barrier | |
146 | write, it will disable again with a warning. | |
147 | Write barriers enforce proper on-disk ordering | |
148 | of journal commits, making volatile disk write caches | |
149 | safe to use, at some performance penalty. If | |
150 | your disks are battery-backed in one way or another, | |
151 | disabling barriers may safely improve performance. | |
fc513a33 DK |
152 | |
153 | orlov (*) This enables the new Orlov block allocator. It is | |
154 | enabled by default. | |
155 | ||
156 | oldalloc This disables the Orlov block allocator and enables | |
157 | the old block allocator. Orlov should have better | |
158 | performance - we'd like to get some feedback if it's | |
159 | the contrary for you. | |
160 | ||
161 | user_xattr Enables Extended User Attributes. Additionally, you | |
162 | need to have extended attribute support enabled in the | |
163 | kernel configuration (CONFIG_EXT4_FS_XATTR). See the | |
164 | attr(5) manual page and http://acl.bestbits.at/ to | |
165 | learn more about extended attributes. | |
166 | ||
167 | nouser_xattr Disables Extended User Attributes. | |
168 | ||
169 | acl Enables POSIX Access Control Lists support. | |
170 | Additionally, you need to have ACL support enabled in | |
171 | the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL). | |
172 | See the acl(5) manual page and http://acl.bestbits.at/ | |
173 | for more information. | |
174 | ||
175 | noacl This option disables POSIX Access Control List | |
176 | support. | |
177 | ||
178 | reservation | |
179 | ||
180 | noreservation | |
181 | ||
182 | bsddf (*) Make 'df' act like BSD. | |
183 | minixdf Make 'df' act like Minix. | |
184 | ||
185 | check=none Don't do extra checking of bitmaps on mount. | |
186 | nocheck | |
187 | ||
188 | debug Extra debugging information is sent to syslog. | |
189 | ||
190 | errors=remount-ro(*) Remount the filesystem read-only on an error. | |
191 | errors=continue Keep going on a filesystem error. | |
192 | errors=panic Panic and halt the machine if an error occurs. | |
193 | ||
194 | grpid Give objects the same group ID as their creator. | |
195 | bsdgroups | |
196 | ||
197 | nogrpid (*) New objects have the group ID of their creator. | |
198 | sysvgroups | |
199 | ||
200 | resgid=n The group ID which may use the reserved blocks. | |
201 | ||
202 | resuid=n The user ID which may use the reserved blocks. | |
203 | ||
204 | sb=n Use alternate superblock at this location. | |
205 | ||
206 | quota | |
207 | noquota | |
208 | grpquota | |
209 | usrquota | |
210 | ||
211 | bh (*) ext4 associates buffer heads to data pages to | |
212 | nobh (a) cache disk block mapping information | |
213 | (b) link pages into transaction to provide | |
214 | ordering guarantees. | |
215 | "bh" option forces use of buffer heads. | |
216 | "nobh" option tries to avoid associating buffer | |
217 | heads (supported only for "writeback" mode). | |
218 | ||
c9de560d AT |
219 | mballoc (*) Use the multiple block allocator for block allocation |
220 | nomballoc disabled multiple block allocator for block allocation. | |
221 | stripe=n Number of filesystem blocks that mballoc will try | |
222 | to use for allocation size and alignment. For RAID5/6 | |
223 | systems this should be the number of data | |
224 | disks * RAID chunk size in file system blocks. | |
fc513a33 DK |
225 | |
226 | Data Mode | |
227 | --------- | |
228 | There are 3 different data modes: | |
229 | ||
230 | * writeback mode | |
231 | In data=writeback mode, ext4 does not journal data at all. This mode provides | |
232 | a similar level of journaling as that of XFS, JFS, and ReiserFS in its default | |
233 | mode - metadata journaling. A crash+recovery can cause incorrect data to | |
234 | appear in files which were written shortly before the crash. This mode will | |
235 | typically provide the best ext4 performance. | |
236 | ||
237 | * ordered mode | |
238 | In data=ordered mode, ext4 only officially journals metadata, but it logically | |
239 | groups metadata and data blocks into a single unit called a transaction. When | |
240 | it's time to write the new metadata out to disk, the associated data blocks | |
241 | are written first. In general, this mode performs slightly slower than | |
242 | writeback but significantly faster than journal mode. | |
243 | ||
244 | * journal mode | |
245 | data=journal mode provides full data and metadata journaling. All new data is | |
246 | written to the journal first, and then to its final location. | |
247 | In the event of a crash, the journal can be replayed, bringing both data and | |
248 | metadata into a consistent state. This mode is the slowest except when data | |
249 | needs to be read from and written to disk at the same time where it | |
250 | outperforms all others modes. | |
251 | ||
252 | References | |
253 | ========== | |
254 | ||
255 | kernel source: <file:fs/ext4/> | |
256 | <file:fs/jbd2/> | |
257 | ||
258 | programs: http://e2fsprogs.sourceforge.net/ | |
259 | http://ext2resize.sourceforge.net | |
260 | ||
261 | useful links: http://fedoraproject.org/wiki/ext3-devel | |
262 | http://www.bullopensource.org/ext4/ |