]>
Commit | Line | Data |
---|---|---|
9eb425c0 PL |
1 | SQUASHFS 4.0 FILESYSTEM |
2 | ======================= | |
3 | ||
4 | Squashfs is a compressed read-only filesystem for Linux. | |
5 | It uses zlib compression to compress files, inodes and directories. | |
6 | Inodes in the system are very small and all blocks are packed to minimise | |
7 | data overhead. Block sizes greater than 4K are supported up to a maximum | |
8 | of 1Mbytes (default block size 128K). | |
9 | ||
10 | Squashfs is intended for general read-only filesystem use, for archival | |
11 | use (i.e. in cases where a .tar.gz file may be used), and in constrained | |
12 | block device/memory systems (e.g. embedded systems) where low overhead is | |
13 | needed. | |
14 | ||
15 | Mailing list: squashfs-devel@lists.sourceforge.net | |
16 | Web site: www.squashfs.org | |
17 | ||
18 | 1. FILESYSTEM FEATURES | |
19 | ---------------------- | |
20 | ||
21 | Squashfs filesystem features versus Cramfs: | |
22 | ||
23 | Squashfs Cramfs | |
24 | ||
edf2e281 | 25 | Max filesystem size: 2^64 256 MiB |
9eb425c0 PL |
26 | Max file size: ~ 2 TiB 16 MiB |
27 | Max files: unlimited unlimited | |
28 | Max directories: unlimited unlimited | |
29 | Max entries per directory: unlimited unlimited | |
30 | Max block size: 1 MiB 4 KiB | |
31 | Metadata compression: yes no | |
32 | Directory indexes: yes no | |
33 | Sparse file support: yes no | |
34 | Tail-end packing (fragments): yes no | |
35 | Exportable (NFS etc.): yes no | |
36 | Hard link support: yes no | |
37 | "." and ".." in readdir: yes no | |
38 | Real inode numbers: yes no | |
39 | 32-bit uids/gids: yes no | |
40 | File creation time: yes no | |
41 | Xattr and ACL support: no no | |
42 | ||
43 | Squashfs compresses data, inodes and directories. In addition, inode and | |
44 | directory data are highly compacted, and packed on byte boundaries. Each | |
45 | compressed inode is on average 8 bytes in length (the exact length varies on | |
46 | file type, i.e. regular file, directory, symbolic link, and block/char device | |
47 | inodes have different sizes). | |
48 | ||
49 | 2. USING SQUASHFS | |
50 | ----------------- | |
51 | ||
52 | As squashfs is a read-only filesystem, the mksquashfs program must be used to | |
53 | create populated squashfs filesystems. This and other squashfs utilities | |
54 | can be obtained from http://www.squashfs.org. Usage instructions can be | |
55 | obtained from this site also. | |
56 | ||
57 | ||
58 | 3. SQUASHFS FILESYSTEM DESIGN | |
59 | ----------------------------- | |
60 | ||
61 | A squashfs filesystem consists of seven parts, packed together on a byte | |
62 | alignment: | |
63 | ||
64 | --------------- | |
65 | | superblock | | |
66 | |---------------| | |
67 | | datablocks | | |
68 | | & fragments | | |
69 | |---------------| | |
70 | | inode table | | |
71 | |---------------| | |
72 | | directory | | |
73 | | table | | |
74 | |---------------| | |
75 | | fragment | | |
76 | | table | | |
77 | |---------------| | |
78 | | export | | |
79 | | table | | |
80 | |---------------| | |
81 | | uid/gid | | |
82 | | lookup table | | |
83 | --------------- | |
84 | ||
85 | Compressed data blocks are written to the filesystem as files are read from | |
86 | the source directory, and checked for duplicates. Once all file data has been | |
87 | written the completed inode, directory, fragment, export and uid/gid lookup | |
88 | tables are written. | |
89 | ||
90 | 3.1 Inodes | |
91 | ---------- | |
92 | ||
93 | Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each | |
94 | compressed block is prefixed by a two byte length, the top bit is set if the | |
95 | block is uncompressed. A block will be uncompressed if the -noI option is set, | |
96 | or if the compressed block was larger than the uncompressed block. | |
97 | ||
98 | Inodes are packed into the metadata blocks, and are not aligned to block | |
99 | boundaries, therefore inodes overlap compressed blocks. Inodes are identified | |
100 | by a 48-bit number which encodes the location of the compressed metadata block | |
101 | containing the inode, and the byte offset into that block where the inode is | |
102 | placed (<block, offset>). | |
103 | ||
104 | To maximise compression there are different inodes for each file type | |
105 | (regular file, directory, device, etc.), the inode contents and length | |
106 | varying with the type. | |
107 | ||
108 | To further maximise compression, two types of regular file inode and | |
109 | directory inode are defined: inodes optimised for frequently occurring | |
110 | regular files and directories, and extended types where extra | |
111 | information has to be stored. | |
112 | ||
113 | 3.2 Directories | |
114 | --------------- | |
115 | ||
116 | Like inodes, directories are packed into compressed metadata blocks, stored | |
117 | in a directory table. Directories are accessed using the start address of | |
118 | the metablock containing the directory and the offset into the | |
119 | decompressed block (<block, offset>). | |
120 | ||
121 | Directories are organised in a slightly complex way, and are not simply | |
122 | a list of file names. The organisation takes advantage of the | |
123 | fact that (in most cases) the inodes of the files will be in the same | |
124 | compressed metadata block, and therefore, can share the start block. | |
125 | Directories are therefore organised in a two level list, a directory | |
126 | header containing the shared start block value, and a sequence of directory | |
127 | entries, each of which share the shared start block. A new directory header | |
128 | is written once/if the inode start block changes. The directory | |
129 | header/directory entry list is repeated as many times as necessary. | |
130 | ||
131 | Directories are sorted, and can contain a directory index to speed up | |
132 | file lookup. Directory indexes store one entry per metablock, each entry | |
133 | storing the index/filename mapping to the first directory header | |
134 | in each metadata block. Directories are sorted in alphabetical order, | |
135 | and at lookup the index is scanned linearly looking for the first filename | |
136 | alphabetically larger than the filename being looked up. At this point the | |
137 | location of the metadata block the filename is in has been found. | |
138 | The general idea of the index is ensure only one metadata block needs to be | |
139 | decompressed to do a lookup irrespective of the length of the directory. | |
140 | This scheme has the advantage that it doesn't require extra memory overhead | |
141 | and doesn't require much extra storage on disk. | |
142 | ||
143 | 3.3 File data | |
144 | ------------- | |
145 | ||
146 | Regular files consist of a sequence of contiguous compressed blocks, and/or a | |
147 | compressed fragment block (tail-end packed block). The compressed size | |
148 | of each datablock is stored in a block list contained within the | |
149 | file inode. | |
150 | ||
151 | To speed up access to datablocks when reading 'large' files (256 Mbytes or | |
152 | larger), the code implements an index cache that caches the mapping from | |
153 | block index to datablock location on disk. | |
154 | ||
155 | The index cache allows Squashfs to handle large files (up to 1.75 TiB) while | |
156 | retaining a simple and space-efficient block list on disk. The cache | |
157 | is split into slots, caching up to eight 224 GiB files (128 KiB blocks). | |
158 | Larger files use multiple slots, with 1.75 TiB files using all 8 slots. | |
159 | The index cache is designed to be memory efficient, and by default uses | |
160 | 16 KiB. | |
161 | ||
162 | 3.4 Fragment lookup table | |
163 | ------------------------- | |
164 | ||
165 | Regular files can contain a fragment index which is mapped to a fragment | |
166 | location on disk and compressed size using a fragment lookup table. This | |
167 | fragment lookup table is itself stored compressed into metadata blocks. | |
168 | A second index table is used to locate these. This second index table for | |
169 | speed of access (and because it is small) is read at mount time and cached | |
170 | in memory. | |
171 | ||
172 | 3.5 Uid/gid lookup table | |
173 | ------------------------ | |
174 | ||
175 | For space efficiency regular files store uid and gid indexes, which are | |
176 | converted to 32-bit uids/gids using an id look up table. This table is | |
177 | stored compressed into metadata blocks. A second index table is used to | |
178 | locate these. This second index table for speed of access (and because it | |
179 | is small) is read at mount time and cached in memory. | |
180 | ||
181 | 3.6 Export table | |
182 | ---------------- | |
183 | ||
184 | To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems | |
185 | can optionally (disabled with the -no-exports Mksquashfs option) contain | |
186 | an inode number to inode disk location lookup table. This is required to | |
187 | enable Squashfs to map inode numbers passed in filehandles to the inode | |
188 | location on disk, which is necessary when the export code reinstantiates | |
189 | expired/flushed inodes. | |
190 | ||
191 | This table is stored compressed into metadata blocks. A second index table is | |
192 | used to locate these. This second index table for speed of access (and because | |
193 | it is small) is read at mount time and cached in memory. | |
194 | ||
195 | ||
196 | 4. TODOS AND OUTSTANDING ISSUES | |
197 | ------------------------------- | |
198 | ||
199 | 4.1 Todo list | |
200 | ------------- | |
201 | ||
202 | Implement Xattr and ACL support. The Squashfs 4.0 filesystem layout has hooks | |
203 | for these but the code has not been written. Once the code has been written | |
204 | the existing layout should not require modification. | |
205 | ||
206 | 4.2 Squashfs internal cache | |
207 | --------------------------- | |
208 | ||
209 | Blocks in Squashfs are compressed. To avoid repeatedly decompressing | |
210 | recently accessed data Squashfs uses two small metadata and fragment caches. | |
211 | ||
212 | The cache is not used for file datablocks, these are decompressed and cached in | |
213 | the page-cache in the normal way. The cache is used to temporarily cache | |
214 | fragment and metadata blocks which have been read as a result of a metadata | |
215 | (i.e. inode or directory) or fragment access. Because metadata and fragments | |
216 | are packed together into blocks (to gain greater compression) the read of a | |
217 | particular piece of metadata or fragment will retrieve other metadata/fragments | |
218 | which have been packed with it, these because of locality-of-reference may be | |
219 | read in the near future. Temporarily caching them ensures they are available | |
220 | for near future access without requiring an additional read and decompress. | |
221 | ||
222 | In the future this internal cache may be replaced with an implementation which | |
223 | uses the kernel page cache. Because the page cache operates on page sized | |
224 | units this may introduce additional complexity in terms of locking and | |
225 | associated race conditions. |