]>
Commit | Line | Data |
---|---|---|
fc513a33 DK |
1 | |
2 | Ext4 Filesystem | |
3 | =============== | |
4 | ||
22359f57 DC |
5 | Ext4 is an an advanced level of the ext3 filesystem which incorporates |
6 | scalability and reliability enhancements for supporting large filesystems | |
7 | (64 bit) in keeping with increasing disk capacities and state-of-the-art | |
8 | feature requirements. | |
fc513a33 | 9 | |
22359f57 DC |
10 | Mailing list: linux-ext4@vger.kernel.org |
11 | Web site: http://ext4.wiki.kernel.org | |
fc513a33 DK |
12 | |
13 | ||
14 | 1. Quick usage instructions: | |
15 | =========================== | |
16 | ||
22359f57 DC |
17 | Note: More extensive information for getting started with ext4 can be |
18 | found at the ext4 wiki site at the URL: | |
19 | http://ext4.wiki.kernel.org/index.php/Ext4_Howto | |
20 | ||
93e3270c | 21 | - Compile and install the latest version of e2fsprogs (as of this |
22359f57 | 22 | writing version 1.41.3) from: |
93e3270c JS |
23 | |
24 | http://sourceforge.net/project/showfiles.php?group_id=2406 | |
25 | ||
26 | or | |
27 | ||
fc513a33 DK |
28 | ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/ |
29 | ||
93e3270c JS |
30 | or grab the latest git repository from: |
31 | ||
32 | git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git | |
33 | ||
4537398d TT |
34 | - Note that it is highly important to install the mke2fs.conf file |
35 | that comes with the e2fsprogs 1.41.x sources in /etc/mke2fs.conf. If | |
36 | you have edited the /etc/mke2fs.conf file installed on your system, | |
37 | you will need to merge your changes with the version from e2fsprogs | |
38 | 1.41.x. | |
39 | ||
03010a33 | 40 | - Create a new filesystem using the ext4 filesystem type: |
93e3270c | 41 | |
03010a33 | 42 | # mke2fs -t ext4 /dev/hda1 |
93e3270c | 43 | |
22359f57 | 44 | Or to configure an existing ext3 filesystem to support extents: |
fc513a33 | 45 | |
22359f57 | 46 | # tune2fs -O extents /dev/hda1 |
fc513a33 | 47 | |
93e3270c JS |
48 | If the filesystem was created with 128 byte inodes, it can be |
49 | converted to use 256 byte for greater efficiency via: | |
fc513a33 | 50 | |
93e3270c | 51 | # tune2fs -I 256 /dev/hda1 |
fc513a33 | 52 | |
03010a33 | 53 | (Note: we currently do not have tools to convert an ext4 |
93e3270c JS |
54 | filesystem back to ext3; so please do not do try this on production |
55 | filesystems.) | |
fc513a33 | 56 | |
93e3270c JS |
57 | - Mounting: |
58 | ||
03010a33 | 59 | # mount -t ext4 /dev/hda1 /wherever |
fc513a33 DK |
60 | |
61 | - When comparing performance with other filesystems, remember that | |
93e3270c JS |
62 | ext3/4 by default offers higher data integrity guarantees than most. |
63 | So when comparing with a metadata-only journalling filesystem, such | |
64 | as ext3, use `mount -o data=writeback'. And you might as well use | |
65 | `mount -o nobh' too along with it. Making the journal larger than | |
66 | the mke2fs default often helps performance with metadata-intensive | |
67 | workloads. | |
fc513a33 DK |
68 | |
69 | 2. Features | |
70 | =========== | |
71 | ||
72 | 2.1 Currently available | |
73 | ||
93e3270c | 74 | * ability to use filesystems > 16TB (e2fsprogs support not available yet) |
fc513a33 DK |
75 | * extent format reduces metadata overhead (RAM, IO for access, transactions) |
76 | * extent format more robust in face of on-disk corruption due to magics, | |
77 | * internal redunancy in tree | |
49f1487b | 78 | * improved file allocation (multi-block alloc) |
93e3270c JS |
79 | * fix 32000 subdirectory limit |
80 | * nsec timestamps for mtime, atime, ctime, create time | |
81 | * inode version field on disk (NFSv4, Lustre) | |
82 | * reduced e2fsck time via uninit_bg feature | |
83 | * journal checksumming for robustness, performance | |
84 | * persistent file preallocation (e.g for streaming media, databases) | |
85 | * ability to pack bitmaps and inode tables into larger virtual groups via the | |
86 | flex_bg feature | |
87 | * large file support | |
88 | * Inode allocation using large virtual block groups via flex_bg | |
49f1487b MC |
89 | * delayed allocation |
90 | * large block (up to pagesize) support | |
91 | * efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force | |
92 | the ordering) | |
fc513a33 DK |
93 | |
94 | 2.2 Candidate features for future inclusion | |
95 | ||
93e3270c JS |
96 | * Online defrag (patches available but not well tested) |
97 | * reduced mke2fs time via lazy itable initialization in conjuction with | |
98 | the uninit_bg feature (capability to do this is available in e2fsprogs | |
99 | but a kernel thread to do lazy zeroing of unused inode table blocks | |
100 | after filesystem is first mounted is required for safety) | |
fc513a33 | 101 | |
93e3270c JS |
102 | There are several others under discussion, whether they all make it in is |
103 | partly a function of how much time everyone has to work on them. Features like | |
104 | metadata checksumming have been discussed and planned for a bit but no patches | |
105 | exist yet so I'm not sure they're in the near-term roadmap. | |
fc513a33 | 106 | |
93e3270c JS |
107 | The big performance win will come with mballoc, delalloc and flex_bg |
108 | grouping of bitmaps and inode tables. Some test results available here: | |
fc513a33 | 109 | |
22359f57 DC |
110 | - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html |
111 | - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html | |
fc513a33 DK |
112 | |
113 | 3. Options | |
114 | ========== | |
115 | ||
116 | When mounting an ext4 filesystem, the following option are accepted: | |
117 | (*) == default | |
118 | ||
c9de560d | 119 | extents (*) ext4 will use extents to address file data. The |
fc513a33 DK |
120 | file system will no longer be mountable by ext3. |
121 | ||
c9de560d AT |
122 | noextents ext4 will not use extents for newly created files |
123 | ||
818d276c GS |
124 | journal_checksum Enable checksumming of the journal transactions. |
125 | This will allow the recovery code in e2fsck and the | |
126 | kernel to detect corruption in the kernel. It is a | |
127 | compatible change and will be ignored by older kernels. | |
128 | ||
129 | journal_async_commit Commit block can be written to disk without waiting | |
130 | for descriptor blocks. If enabled older kernels cannot | |
131 | mount the device. This will enable 'journal_checksum' | |
132 | internally. | |
133 | ||
fc513a33 DK |
134 | journal=update Update the ext4 file system's journal to the current |
135 | format. | |
136 | ||
137 | journal=inum When a journal already exists, this option is ignored. | |
138 | Otherwise, it specifies the number of the inode which | |
139 | will represent the ext4 file system's journal file. | |
140 | ||
141 | journal_dev=devnum When the external journal device's major/minor numbers | |
142 | have changed, this option allows the user to specify | |
143 | the new journal location. The journal device is | |
144 | identified through its new major/minor numbers encoded | |
145 | in devnum. | |
146 | ||
147 | noload Don't load the journal on mounting. | |
148 | ||
149 | data=journal All data are committed into the journal prior to being | |
150 | written into the main file system. | |
151 | ||
152 | data=ordered (*) All data are forced directly out to the main file | |
153 | system prior to its metadata being committed to the | |
154 | journal. | |
155 | ||
156 | data=writeback Data ordering is not preserved, data may be written | |
157 | into the main file system after its metadata has been | |
158 | committed to the journal. | |
159 | ||
160 | commit=nrsec (*) Ext4 can be told to sync all its data and metadata | |
161 | every 'nrsec' seconds. The default value is 5 seconds. | |
162 | This means that if you lose your power, you will lose | |
163 | as much as the latest 5 seconds of work (your | |
164 | filesystem will not be damaged though, thanks to the | |
165 | journaling). This default value (or any low value) | |
166 | will hurt performance, but it's good for data-safety. | |
167 | Setting it to 0 will have the same effect as leaving | |
168 | it at the default (5 seconds). | |
169 | Setting it to very large values will improve | |
170 | performance. | |
171 | ||
571640ca ES |
172 | barrier=<0|1(*)> This enables/disables the use of write barriers in |
173 | the jbd code. barrier=0 disables, barrier=1 enables. | |
174 | This also requires an IO stack which can support | |
175 | barriers, and if jbd gets an error on a barrier | |
176 | write, it will disable again with a warning. | |
177 | Write barriers enforce proper on-disk ordering | |
178 | of journal commits, making volatile disk write caches | |
179 | safe to use, at some performance penalty. If | |
180 | your disks are battery-backed in one way or another, | |
181 | disabling barriers may safely improve performance. | |
fc513a33 | 182 | |
240799cd TT |
183 | inode_readahead=n This tuning parameter controls the maximum |
184 | number of inode table blocks that ext4's inode | |
185 | table readahead algorithm will pre-read into | |
186 | the buffer cache. The default value is 32 blocks. | |
187 | ||
fc513a33 DK |
188 | orlov (*) This enables the new Orlov block allocator. It is |
189 | enabled by default. | |
190 | ||
191 | oldalloc This disables the Orlov block allocator and enables | |
192 | the old block allocator. Orlov should have better | |
193 | performance - we'd like to get some feedback if it's | |
194 | the contrary for you. | |
195 | ||
196 | user_xattr Enables Extended User Attributes. Additionally, you | |
197 | need to have extended attribute support enabled in the | |
198 | kernel configuration (CONFIG_EXT4_FS_XATTR). See the | |
199 | attr(5) manual page and http://acl.bestbits.at/ to | |
200 | learn more about extended attributes. | |
201 | ||
202 | nouser_xattr Disables Extended User Attributes. | |
203 | ||
204 | acl Enables POSIX Access Control Lists support. | |
205 | Additionally, you need to have ACL support enabled in | |
206 | the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL). | |
207 | See the acl(5) manual page and http://acl.bestbits.at/ | |
208 | for more information. | |
209 | ||
210 | noacl This option disables POSIX Access Control List | |
211 | support. | |
212 | ||
213 | reservation | |
214 | ||
215 | noreservation | |
216 | ||
217 | bsddf (*) Make 'df' act like BSD. | |
218 | minixdf Make 'df' act like Minix. | |
219 | ||
fc513a33 DK |
220 | debug Extra debugging information is sent to syslog. |
221 | ||
222 | errors=remount-ro(*) Remount the filesystem read-only on an error. | |
223 | errors=continue Keep going on a filesystem error. | |
224 | errors=panic Panic and halt the machine if an error occurs. | |
225 | ||
5bf5683a HK |
226 | data_err=ignore(*) Just print an error message if an error occurs |
227 | in a file data buffer in ordered mode. | |
228 | data_err=abort Abort the journal if an error occurs in a file | |
229 | data buffer in ordered mode. | |
230 | ||
fc513a33 DK |
231 | grpid Give objects the same group ID as their creator. |
232 | bsdgroups | |
233 | ||
234 | nogrpid (*) New objects have the group ID of their creator. | |
235 | sysvgroups | |
236 | ||
237 | resgid=n The group ID which may use the reserved blocks. | |
238 | ||
239 | resuid=n The user ID which may use the reserved blocks. | |
240 | ||
241 | sb=n Use alternate superblock at this location. | |
242 | ||
243 | quota | |
244 | noquota | |
245 | grpquota | |
246 | usrquota | |
247 | ||
248 | bh (*) ext4 associates buffer heads to data pages to | |
249 | nobh (a) cache disk block mapping information | |
250 | (b) link pages into transaction to provide | |
251 | ordering guarantees. | |
252 | "bh" option forces use of buffer heads. | |
253 | "nobh" option tries to avoid associating buffer | |
254 | heads (supported only for "writeback" mode). | |
255 | ||
c9de560d AT |
256 | stripe=n Number of filesystem blocks that mballoc will try |
257 | to use for allocation size and alignment. For RAID5/6 | |
258 | systems this should be the number of data | |
259 | disks * RAID chunk size in file system blocks. | |
49f1487b MC |
260 | delalloc (*) Deferring block allocation until write-out time. |
261 | nodelalloc Disable delayed allocation. Blocks are allocation | |
262 | when data is copied from user to page cache. | |
240799cd | 263 | |
fc513a33 | 264 | Data Mode |
93e3270c | 265 | ========= |
fc513a33 DK |
266 | There are 3 different data modes: |
267 | ||
268 | * writeback mode | |
269 | In data=writeback mode, ext4 does not journal data at all. This mode provides | |
270 | a similar level of journaling as that of XFS, JFS, and ReiserFS in its default | |
271 | mode - metadata journaling. A crash+recovery can cause incorrect data to | |
272 | appear in files which were written shortly before the crash. This mode will | |
273 | typically provide the best ext4 performance. | |
274 | ||
275 | * ordered mode | |
276 | In data=ordered mode, ext4 only officially journals metadata, but it logically | |
49f1487b MC |
277 | groups metadata information related to data changes with the data blocks into a |
278 | single unit called a transaction. When it's time to write the new metadata | |
279 | out to disk, the associated data blocks are written first. In general, | |
280 | this mode performs slightly slower than writeback but significantly faster than journal mode. | |
fc513a33 DK |
281 | |
282 | * journal mode | |
283 | data=journal mode provides full data and metadata journaling. All new data is | |
284 | written to the journal first, and then to its final location. | |
285 | In the event of a crash, the journal can be replayed, bringing both data and | |
286 | metadata into a consistent state. This mode is the slowest except when data | |
287 | needs to be read from and written to disk at the same time where it | |
49f1487b MC |
288 | outperforms all others modes. Curently ext4 does not have delayed |
289 | allocation support if this data journalling mode is selected. | |
fc513a33 DK |
290 | |
291 | References | |
292 | ========== | |
293 | ||
294 | kernel source: <file:fs/ext4/> | |
295 | <file:fs/jbd2/> | |
296 | ||
297 | programs: http://e2fsprogs.sourceforge.net/ | |
fc513a33 DK |
298 | |
299 | useful links: http://fedoraproject.org/wiki/ext3-devel | |
300 | http://www.bullopensource.org/ext4/ | |
93e3270c JS |
301 | http://ext4.wiki.kernel.org/index.php/Main_Page |
302 | http://fedoraproject.org/wiki/Features/Ext4 |