]>
Commit | Line | Data |
---|---|---|
1da177e4 LT |
1 | |
2 | Making Filesystems Exportable | |
3 | ============================= | |
4 | ||
5 | Most filesystem operations require a dentry (or two) as a starting | |
6 | point. Local applications have a reference-counted hold on suitable | |
7 | dentrys via open file descriptors or cwd/root. However remote | |
8 | applications that access a filesystem via a remote filesystem protocol | |
9 | such as NFS may not be able to hold such a reference, and so need a | |
10 | different way to refer to a particular dentry. As the alternative | |
11 | form of reference needs to be stable across renames, truncates, and | |
12 | server-reboot (among other things, though these tend to be the most | |
13 | problematic), there is no simple answer like 'filename'. | |
14 | ||
15 | The mechanism discussed here allows each filesystem implementation to | |
16 | specify how to generate an opaque (out side of the filesystem) byte | |
17 | string for any dentry, and how to find an appropriate dentry for any | |
18 | given opaque byte string. | |
19 | This byte string will be called a "filehandle fragment" as it | |
20 | corresponds to part of an NFS filehandle. | |
21 | ||
22 | A filesystem which supports the mapping between filehandle fragments | |
23 | and dentrys will be termed "exportable". | |
24 | ||
25 | ||
26 | ||
27 | Dcache Issues | |
28 | ------------- | |
29 | ||
30 | The dcache normally contains a proper prefix of any given filesystem | |
31 | tree. This means that if any filesystem object is in the dcache, then | |
32 | all of the ancestors of that filesystem object are also in the dcache. | |
33 | As normal access is by filename this prefix is created naturally and | |
34 | maintained easily (by each object maintaining a reference count on | |
35 | its parent). | |
36 | ||
37 | However when objects are included into the dcache by interpreting a | |
38 | filehandle fragment, there is no automatic creation of a path prefix | |
39 | for the object. This leads to two related but distinct features of | |
40 | the dcache that are not needed for normal filesystem access. | |
41 | ||
42 | 1/ The dcache must sometimes contain objects that are not part of the | |
43 | proper prefix. i.e that are not connected to the root. | |
44 | 2/ The dcache must be prepared for a newly found (via ->lookup) directory | |
45 | to already have a (non-connected) dentry, and must be able to move | |
46 | that dentry into place (based on the parent and name in the | |
47 | ->lookup). This is particularly needed for directories as | |
48 | it is a dcache invariant that directories only have one dentry. | |
49 | ||
50 | To implement these features, the dcache has: | |
51 | ||
52 | a/ A dentry flag DCACHE_DISCONNECTED which is set on | |
53 | any dentry that might not be part of the proper prefix. | |
54 | This is set when anonymous dentries are created, and cleared when a | |
55 | dentry is noticed to be a child of a dentry which is in the proper | |
56 | prefix. | |
57 | ||
58 | b/ A per-superblock list "s_anon" of dentries which are the roots of | |
59 | subtrees that are not in the proper prefix. These dentries, as | |
60 | well as the proper prefix, need to be released at unmount time. As | |
61 | these dentries will not be hashed, they are linked together on the | |
62 | d_hash list_head. | |
63 | ||
64 | c/ Helper routines to allocate anonymous dentries, and to help attach | |
65 | loose directory dentries at lookup time. They are: | |
66 | d_alloc_anon(inode) will return a dentry for the given inode. | |
67 | If the inode already has a dentry, one of those is returned. | |
68 | If it doesn't, a new anonymous (IS_ROOT and | |
69 | DCACHE_DISCONNECTED) dentry is allocated and attached. | |
70 | In the case of a directory, care is taken that only one dentry | |
71 | can ever be attached. | |
72 | d_splice_alias(inode, dentry) will make sure that there is a | |
73 | dentry with the same name and parent as the given dentry, and | |
74 | which refers to the given inode. | |
75 | If the inode is a directory and already has a dentry, then that | |
76 | dentry is d_moved over the given dentry. | |
77 | If the passed dentry gets attached, care is taken that this is | |
78 | mutually exclusive to a d_alloc_anon operation. | |
79 | If the passed dentry is used, NULL is returned, else the used | |
80 | dentry is returned. This corresponds to the calling pattern of | |
81 | ->lookup. | |
82 | ||
83 | ||
84 | Filesystem Issues | |
85 | ----------------- | |
86 | ||
87 | For a filesystem to be exportable it must: | |
88 | ||
89 | 1/ provide the filehandle fragment routines described below. | |
90 | 2/ make sure that d_splice_alias is used rather than d_add | |
91 | when ->lookup finds an inode for a given parent and name. | |
92 | Typically the ->lookup routine will end: | |
93 | if (inode) | |
94 | return d_splice(inode, dentry); | |
95 | d_add(dentry, inode); | |
96 | return NULL; | |
97 | } | |
98 | ||
99 | ||
100 | ||
101 | A file system implementation declares that instances of the filesystem | |
102 | are exportable by setting the s_export_op field in the struct | |
103 | super_block. This field must point to a "struct export_operations" | |
104 | struct which could potentially be full of NULLs, though normally at | |
105 | least get_parent will be set. | |
106 | ||
107 | The primary operations are decode_fh and encode_fh. | |
108 | decode_fh takes a filehandle fragment and tries to find or create a | |
109 | dentry for the object referred to by the filehandle. | |
110 | encode_fh takes a dentry and creates a filehandle fragment which can | |
111 | later be used to find/create a dentry for the same object. | |
112 | ||
113 | decode_fh will probably make use of "find_exported_dentry". | |
114 | This function lives in the "exportfs" module which a filesystem does | |
115 | not need unless it is being exported. So rather that calling | |
116 | find_exported_dentry directly, each filesystem should call it through | |
117 | the find_exported_dentry pointer in it's export_operations table. | |
118 | This field is set correctly by the exporting agent (e.g. nfsd) when a | |
119 | filesystem is exported, and before any export operations are called. | |
120 | ||
121 | find_exported_dentry needs three support functions from the | |
122 | filesystem: | |
123 | get_name. When given a parent dentry and a child dentry, this | |
124 | should find a name in the directory identified by the parent | |
125 | dentry, which leads to the object identified by the child dentry. | |
126 | If no get_name function is supplied, a default implementation is | |
127 | provided which uses vfs_readdir to find potential names, and | |
128 | matches inode numbers to find the correct match. | |
129 | ||
130 | get_parent. When given a dentry for a directory, this should return | |
131 | a dentry for the parent. Quite possibly the parent dentry will | |
132 | have been allocated by d_alloc_anon. | |
133 | The default get_parent function just returns an error so any | |
134 | filehandle lookup that requires finding a parent will fail. | |
135 | ->lookup("..") is *not* used as a default as it can leave ".." | |
136 | entries in the dcache which are too messy to work with. | |
137 | ||
138 | get_dentry. When given an opaque datum, this should find the | |
139 | implied object and create a dentry for it (possibly with | |
140 | d_alloc_anon). | |
141 | The opaque datum is whatever is passed down by the decode_fh | |
142 | function, and is often simply a fragment of the filehandle | |
143 | fragment. | |
144 | decode_fh passes two datums through find_exported_dentry. One that | |
145 | should be used to identify the target object, and one that can be | |
146 | used to identify the object's parent, should that be necessary. | |
147 | The default get_dentry function assumes that the datum contains an | |
148 | inode number and a generation number, and it attempts to get the | |
149 | inode using "iget" and check it's validity by matching the | |
150 | generation number. A filesystem should only depend on the default | |
151 | if iget can safely be used this way. | |
152 | ||
153 | If decode_fh and/or encode_fh are left as NULL, then default | |
154 | implementations are used. These defaults are suitable for ext2 and | |
155 | extremely similar filesystems (like ext3). | |
156 | ||
157 | The default encode_fh creates a filehandle fragment from the inode | |
158 | number and generation number of the target together with the inode | |
159 | number and generation number of the parent (if the parent is | |
160 | required). | |
161 | ||
162 | The default decode_fh extract the target and parent datums from the | |
163 | filehandle assuming the format used by the default encode_fh and | |
164 | passed them to find_exported_dentry. | |
165 | ||
166 | ||
167 | A filehandle fragment consists of an array of 1 or more 4byte words, | |
168 | together with a one byte "type". | |
169 | The decode_fh routine should not depend on the stated size that is | |
170 | passed to it. This size may be larger than the original filehandle | |
171 | generated by encode_fh, in which case it will have been padded with | |
172 | nuls. Rather, the encode_fh routine should choose a "type" which | |
173 | indicates the decode_fh how much of the filehandle is valid, and how | |
174 | it should be interpreted. | |
175 | ||
176 |