]>
Commit | Line | Data |
---|---|---|
e66d8631 MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ====================================== | |
4 | Enhanced Read-Only File System - EROFS | |
5 | ====================================== | |
6 | ||
fdb05364 GX |
7 | Overview |
8 | ======== | |
9 | ||
10 | EROFS file-system stands for Enhanced Read-Only File System. Different | |
11 | from other read-only file systems, it aims to be designed for flexibility, | |
12 | scalability, but be kept simple and high performance. | |
13 | ||
14 | It is designed as a better filesystem solution for the following scenarios: | |
e66d8631 | 15 | |
fdb05364 GX |
16 | - read-only storage media or |
17 | ||
18 | - part of a fully trusted read-only solution, which means it needs to be | |
19 | immutable and bit-for-bit identical to the official golden image for | |
20 | their releases due to security and other considerations and | |
21 | ||
22 | - hope to save some extra storage space with guaranteed end-to-end performance | |
23 | by using reduced metadata and transparent file compression, especially | |
24 | for those embedded devices with limited memory (ex, smartphone); | |
25 | ||
26 | Here is the main features of EROFS: | |
e66d8631 | 27 | |
fdb05364 GX |
28 | - Little endian on-disk design; |
29 | ||
30 | - Currently 4KB block size (nobh) and therefore maximum 16TB address space; | |
31 | ||
32 | - Metadata & data could be mixed by design; | |
33 | ||
34 | - 2 inode versions for different requirements: | |
e66d8631 MCC |
35 | |
36 | ===================== ============ ===================================== | |
ffafde47 | 37 | compact (v1) extended (v2) |
e66d8631 MCC |
38 | ===================== ============ ===================================== |
39 | Inode metadata size 32 bytes 64 bytes | |
40 | Max file size 4 GB 16 EB (also limited by max. vol size) | |
41 | Max uids/gids 65536 4294967296 | |
42 | File change time no yes (64 + 32-bit timestamp) | |
43 | Max hardlinks 65536 4294967296 | |
44 | Metadata reserved 4 bytes 14 bytes | |
45 | ===================== ============ ===================================== | |
fdb05364 GX |
46 | |
47 | - Support extended attributes (xattrs) as an option; | |
48 | ||
49 | - Support xattr inline and tail-end data inline for all files; | |
50 | ||
516c115c GX |
51 | - Support POSIX.1e ACLs by using xattrs; |
52 | ||
46f2e044 GX |
53 | - Support transparent data compression as an option: |
54 | LZ4 algorithm with the fixed-sized output compression for high performance. | |
fdb05364 GX |
55 | |
56 | The following git tree provides the file system user-space tools under | |
57 | development (ex, formatting tool mkfs.erofs): | |
e66d8631 MCC |
58 | |
59 | - git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git | |
fdb05364 GX |
60 | |
61 | Bugs and patches are welcome, please kindly help us and send to the following | |
62 | linux-erofs mailing list: | |
e66d8631 MCC |
63 | |
64 | - linux-erofs mailing list <linux-erofs@lists.ozlabs.org> | |
fdb05364 | 65 | |
fdb05364 GX |
66 | Mount options |
67 | ============= | |
68 | ||
e66d8631 | 69 | =================== ========================================================= |
fdb05364 GX |
70 | (no)user_xattr Setup Extended User Attributes. Note: xattr is enabled |
71 | by default if CONFIG_EROFS_FS_XATTR is selected. | |
72 | (no)acl Setup POSIX Access Control List. Note: acl is enabled | |
73 | by default if CONFIG_EROFS_FS_POSIX_ACL is selected. | |
4279f3f9 | 74 | cache_strategy=%s Select a strategy for cached decompression from now on: |
e66d8631 MCC |
75 | |
76 | ========== ============================================= | |
77 | disabled In-place I/O decompression only; | |
78 | readahead Cache the last incomplete compressed physical | |
4279f3f9 GX |
79 | cluster for further reading. It still does |
80 | in-place I/O decompression for the rest | |
81 | compressed physical clusters; | |
e66d8631 | 82 | readaround Cache the both ends of incomplete compressed |
4279f3f9 GX |
83 | physical clusters for further reading. |
84 | It still does in-place I/O decompression | |
85 | for the rest compressed physical clusters. | |
e66d8631 | 86 | ========== ============================================= |
06252e9c GX |
87 | dax={always,never} Use direct access (no page cache). See |
88 | Documentation/filesystems/dax.rst. | |
89 | dax A legacy option which is an alias for ``dax=always``. | |
e66d8631 | 90 | =================== ========================================================= |
fdb05364 GX |
91 | |
92 | On-disk details | |
93 | =============== | |
94 | ||
95 | Summary | |
96 | ------- | |
97 | Different from other read-only file systems, an EROFS volume is designed | |
e66d8631 | 98 | to be as simple as possible:: |
fdb05364 GX |
99 | |
100 | |-> aligned with the block size | |
101 | ____________________________________________________________ | |
102 | | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data | | |
103 | |_|__|_|_____|__________|_____|______|__________|_____|______| | |
104 | 0 +1K | |
105 | ||
106 | All data areas should be aligned with the block size, but metadata areas | |
107 | may not. All metadatas can be now observed in two different spaces (views): | |
e66d8631 | 108 | |
fdb05364 | 109 | 1. Inode metadata space |
e66d8631 | 110 | |
fdb05364 | 111 | Each valid inode should be aligned with an inode slot, which is a fixed |
ffafde47 | 112 | value (32 bytes) and designed to be kept in line with compact inode size. |
fdb05364 GX |
113 | |
114 | Each inode can be directly found with the following formula: | |
115 | inode offset = meta_blkaddr * block_size + 32 * nid | |
116 | ||
e66d8631 MCC |
117 | :: |
118 | ||
1b55767d GX |
119 | |-> aligned with 8B |
120 | |-> followed closely | |
121 | + meta_blkaddr blocks |-> another slot | |
122 | _____________________________________________________________________ | |
123 | | ... | inode | xattrs | extents | data inline | ... | inode ... | |
124 | |________|_______|(optional)|(optional)|__(optional)_|_____|__________ | |
125 | |-> aligned with the inode slot size | |
126 | . . | |
127 | . . | |
128 | . . | |
129 | . . | |
130 | . . | |
131 | . . | |
132 | .____________________________________________________|-> aligned with 4B | |
133 | | xattr_ibody_header | shared xattrs | inline xattrs | | |
134 | |____________________|_______________|_______________| | |
135 | |-> 12 bytes <-|->x * 4 bytes<-| . | |
136 | . . . | |
137 | . . . | |
138 | . . . | |
139 | ._______________________________.______________________. | |
140 | | id | id | id | id | ... | id | ent | ... | ent| ... | | |
141 | |____|____|____|____|______|____|_____|_____|____|_____| | |
142 | |-> aligned with 4B | |
143 | |-> aligned with 4B | |
fdb05364 GX |
144 | |
145 | Inode could be 32 or 64 bytes, which can be distinguished from a common | |
e66d8631 | 146 | field which all inode versions have -- i_format:: |
fdb05364 GX |
147 | |
148 | __________________ __________________ | |
ffafde47 | 149 | | i_format | | i_format | |
fdb05364 GX |
150 | |__________________| |__________________| |
151 | | ... | | ... | | |
152 | | | | | | |
153 | |__________________| 32 bytes | | | |
154 | | | | |
155 | |__________________| 64 bytes | |
156 | ||
157 | Xattrs, extents, data inline are followed by the corresponding inode with | |
ffafde47 | 158 | proper alignment, and they could be optional for different data mappings. |
2a9dc7a8 | 159 | _currently_ total 5 data layouts are supported: |
fdb05364 | 160 | |
e66d8631 | 161 | == ==================================================================== |
ffafde47 GX |
162 | 0 flat file data without data inline (no extent); |
163 | 1 fixed-sized output data compression (with non-compacted indexes); | |
164 | 2 flat file data with tail packing data inline (no extent); | |
2a9dc7a8 GX |
165 | 3 fixed-sized output data compression (with compacted indexes, v5.3+); |
166 | 4 chunk-based file (v5.15+). | |
e66d8631 | 167 | == ==================================================================== |
fdb05364 GX |
168 | |
169 | The size of the optional xattrs is indicated by i_xattr_count in inode | |
170 | header. Large xattrs or xattrs shared by many different files can be | |
171 | stored in shared xattrs metadata rather than inlined right after inode. | |
172 | ||
173 | 2. Shared xattrs metadata space | |
e66d8631 | 174 | |
fdb05364 GX |
175 | Shared xattrs space is similar to the above inode space, started with |
176 | a specific block indicated by xattr_blkaddr, organized one by one with | |
177 | proper align. | |
178 | ||
179 | Each share xattr can also be directly found by the following formula: | |
180 | xattr offset = xattr_blkaddr * block_size + 4 * xattr_id | |
181 | ||
1b55767d | 182 | :: |
e66d8631 | 183 | |
1b55767d GX |
184 | |-> aligned by 4 bytes |
185 | + xattr_blkaddr blocks |-> aligned with 4 bytes | |
186 | _________________________________________________________________________ | |
187 | | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ... | |
188 | |________|_____________|_____________|_____|______________|_______________ | |
fdb05364 GX |
189 | |
190 | Directories | |
191 | ----------- | |
192 | All directories are now organized in a compact on-disk format. Note that | |
193 | each directory block is divided into index and name areas in order to support | |
194 | random file lookup, and all directory entries are _strictly_ recorded in | |
195 | alphabetical order in order to support improved prefix binary search | |
196 | algorithm (could refer to the related source code). | |
197 | ||
e66d8631 MCC |
198 | :: |
199 | ||
1b55767d GX |
200 | ___________________________ |
201 | / | | |
202 | / ______________|________________ | |
203 | / / | nameoff1 | nameoffN-1 | |
204 | ____________.______________._______________v________________v__________ | |
205 | | dirent | dirent | ... | dirent | filename | filename | ... | filename | | |
206 | |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____| | |
207 | \ ^ | |
208 | \ | * could have | |
209 | \ | trailing '\0' | |
210 | \________________________| nameoff0 | |
211 | Directory block | |
fdb05364 GX |
212 | |
213 | Note that apart from the offset of the first filename, nameoff0 also indicates | |
214 | the total number of directory entries in this block since it is no need to | |
215 | introduce another on-disk field at all. | |
216 | ||
2a9dc7a8 GX |
217 | Chunk-based file |
218 | ---------------- | |
219 | In order to support chunk-based data deduplication, a new inode data layout has | |
220 | been supported since Linux v5.15: Files are split in equal-sized data chunks | |
221 | with ``extents`` area of the inode metadata indicating how to get the chunk | |
222 | data: these can be simply as a 4-byte block address array or in the 8-byte | |
223 | chunk index form (see struct erofs_inode_chunk_index in erofs_fs.h for more | |
224 | details.) | |
225 | ||
226 | By the way, chunk-based files are all uncompressed for now. | |
227 | ||
46f2e044 GX |
228 | Data compression |
229 | ---------------- | |
230 | EROFS implements LZ4 fixed-sized output compression which generates fixed-sized | |
231 | compressed data blocks from variable-sized input in contrast to other existing | |
232 | fixed-sized input solutions. Relatively higher compression ratios can be gotten | |
233 | by using fixed-sized output compression since nowadays popular data compression | |
234 | algorithms are mostly LZ77-based and such fixed-sized output approach can be | |
235 | benefited from the historical dictionary (aka. sliding window). | |
236 | ||
237 | In details, original (uncompressed) data is turned into several variable-sized | |
238 | extents and in the meanwhile, compressed into physical clusters (pclusters). | |
239 | In order to record each variable-sized extent, logical clusters (lclusters) are | |
240 | introduced as the basic unit of compress indexes to indicate whether a new | |
241 | extent is generated within the range (HEAD) or not (NONHEAD). Lclusters are now | |
242 | fixed in block size, as illustrated below:: | |
e66d8631 | 243 | |
1b55767d GX |
244 | |<- variable-sized extent ->|<- VLE ->| |
245 | clusterofs clusterofs clusterofs | |
246 | | | | | |
247 | _________v_________________________________v_______________________v________ | |
248 | ... | . | | . | | . ... | |
249 | ____|____._________|______________|________.___ _|______________|__.________ | |
250 | |-> lcluster <-|-> lcluster <-|-> lcluster <-|-> lcluster <-| | |
46f2e044 GX |
251 | (HEAD) (NONHEAD) (HEAD) (NONHEAD) . |
252 | . CBLKCNT . . | |
253 | . . . | |
254 | . . . | |
255 | _______._____________________________.______________._________________ | |
1b55767d GX |
256 | ... | | | | ... |
257 | _______|______________|______________|______________|_________________ | |
46f2e044 GX |
258 | |-> big pcluster <-|-> pcluster <-| |
259 | ||
260 | A physical cluster can be seen as a container of physical compressed blocks | |
261 | which contains compressed data. Previously, only lcluster-sized (4KB) pclusters | |
262 | were supported. After big pcluster feature is introduced (available since | |
263 | Linux v5.13), pcluster can be a multiple of lcluster size. | |
264 | ||
265 | For each HEAD lcluster, clusterofs is recorded to indicate where a new extent | |
266 | starts and blkaddr is used to seek the compressed data. For each NONHEAD | |
267 | lcluster, delta0 and delta1 are available instead of blkaddr to indicate the | |
268 | distance to its HEAD lcluster and the next HEAD lcluster. A PLAIN lcluster is | |
269 | also a HEAD lcluster except that its data is uncompressed. See the comments | |
270 | around "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details. | |
271 | ||
272 | If big pcluster is enabled, pcluster size in lclusters needs to be recorded as | |
273 | well. Let the delta0 of the first NONHEAD lcluster store the compressed block | |
274 | count with a special flag as a new called CBLKCNT NONHEAD lcluster. It's easy | |
275 | to understand its delta0 is constantly 1, as illustrated below:: | |
276 | ||
277 | __________________________________________________________ | |
278 | | HEAD | NONHEAD | NONHEAD | ... | NONHEAD | HEAD | HEAD | | |
279 | |__:___|_(CBLKCNT)_|_________|_____|_________|__:___|____:_| | |
280 | |<----- a big pcluster (with CBLKCNT) ------>|<-- -->| | |
281 | a lcluster-sized pcluster (without CBLKCNT) ^ | |
282 | ||
283 | If another HEAD follows a HEAD lcluster, there is no room to record CBLKCNT, | |
284 | but it's easy to know the size of such pcluster is 1 lcluster as well. |