]>
Commit | Line | Data |
---|---|---|
cbf8f0f3 PE |
1 | RCU-based dcache locking model |
2 | ============================== | |
3 | ||
4 | On many workloads, the most common operation on dcache is to look up a | |
5 | dentry, given a parent dentry and the name of the child. Typically, | |
6 | for every open(), stat() etc., the dentry corresponding to the | |
7 | pathname will be looked up by walking the tree starting with the first | |
8 | component of the pathname and using that dentry along with the next | |
9 | component to look up the next level and so on. Since it is a frequent | |
10 | operation for workloads like multiuser environments and web servers, | |
11 | it is important to optimize this path. | |
12 | ||
13 | Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus in | |
14 | every component during path look-up. Since 2.5.10 onwards, fast-walk | |
15 | algorithm changed this by holding the dcache_lock at the beginning and | |
16 | walking as many cached path component dentries as possible. This | |
17 | significantly decreases the number of acquisition of | |
18 | dcache_lock. However it also increases the lock hold time | |
19 | significantly and affects performance in large SMP machines. Since | |
20 | 2.5.62 kernel, dcache has been using a new locking model that uses RCU | |
21 | to make dcache look-up lock-free. | |
22 | ||
23 | The current dcache locking model is not very different from the | |
24 | existing dcache locking model. Prior to 2.5.62 kernel, dcache_lock | |
25 | protected the hash chain, d_child, d_alias, d_lru lists as well as | |
26 | d_inode and several other things like mount look-up. RCU-based changes | |
27 | affect only the way the hash chain is protected. For everything else | |
28 | the dcache_lock must be taken for both traversing as well as | |
29 | updating. The hash chain updates too take the dcache_lock. The | |
30 | significant change is the way d_lookup traverses the hash chain, it | |
31 | doesn't acquire the dcache_lock for this and rely on RCU to ensure | |
32 | that the dentry has not been *freed*. | |
33 | ||
b5c84bf6 | 34 | dcache_lock no longer exists, dentry locking is explained in fs/dcache.c |
cbf8f0f3 PE |
35 | |
36 | Dcache locking details | |
37 | ====================== | |
38 | ||
39 | For many multi-user workloads, open() and stat() on files are very | |
40 | frequently occurring operations. Both involve walking of path names to | |
41 | find the dentry corresponding to the concerned file. In 2.4 kernel, | |
42 | dcache_lock was held during look-up of each path component. Contention | |
43 | and cache-line bouncing of this global lock caused significant | |
44 | scalability problems. With the introduction of RCU in Linux kernel, | |
45 | this was worked around by making the look-up of path components during | |
46 | path walking lock-free. | |
47 | ||
48 | ||
49 | Safe lock-free look-up of dcache hash table | |
50 | =========================================== | |
51 | ||
52 | Dcache is a complex data structure with the hash table entries also | |
53 | linked together in other lists. In 2.4 kernel, dcache_lock protected | |
b5c84bf6 | 54 | all the lists. RCU dentry hash walking works like this: |
cbf8f0f3 PE |
55 | |
56 | 1. The deletion from hash chain is done using hlist_del_rcu() macro | |
57 | which doesn't initialize next pointer of the deleted dentry and | |
58 | this allows us to walk safely lock-free while a deletion is | |
b5c84bf6 | 59 | happening. This is a standard hlist_rcu iteration. |
cbf8f0f3 PE |
60 | |
61 | 2. Insertion of a dentry into the hash table is done using | |
62 | hlist_add_head_rcu() which take care of ordering the writes - the | |
63 | writes to the dentry must be visible before the dentry is | |
4c54005c PM |
64 | inserted. This works in conjunction with hlist_for_each_rcu(), |
65 | which has since been replaced by hlist_for_each_entry_rcu(), while | |
cbf8f0f3 PE |
66 | walking the hash chain. The only requirement is that all |
67 | initialization to the dentry must be done before | |
b5c84bf6 NP |
68 | hlist_add_head_rcu() since we don't have lock protection |
69 | while traversing the hash chain. | |
cbf8f0f3 | 70 | |
b5c84bf6 NP |
71 | 3. The dentry looked up without holding locks cannot be returned for |
72 | walking if it is unhashed. It then may have a NULL d_inode or other | |
73 | bogosity since RCU doesn't protect the other fields in the dentry. We | |
74 | therefore use a flag DCACHE_UNHASHED to indicate unhashed dentries | |
75 | and use this in conjunction with a per-dentry lock (d_lock). Once | |
76 | looked up without locks, we acquire the per-dentry lock (d_lock) and | |
77 | check if the dentry is unhashed. If so, the look-up is failed. If not, | |
78 | the reference count of the dentry is increased and the dentry is | |
79 | returned. | |
cbf8f0f3 PE |
80 | |
81 | 4. Once a dentry is looked up, it must be ensured during the path walk | |
82 | for that component it doesn't go away. In pre-2.5.10 code, this was | |
83 | done holding a reference to the dentry. dcache_rcu does the same. | |
84 | In some sense, dcache_rcu path walking looks like the pre-2.5.10 | |
85 | version. | |
86 | ||
b5c84bf6 NP |
87 | 5. All dentry hash chain updates must take the per-dentry lock (see |
88 | fs/dcache.c). This excludes dput() to ensure that a dentry that has | |
89 | been looked up concurrently does not get deleted before dget() can | |
90 | take a ref. | |
cbf8f0f3 PE |
91 | |
92 | 6. There are several ways to do reference counting of RCU protected | |
93 | objects. One such example is in ipv4 route cache where deferred | |
94 | freeing (using call_rcu()) is done as soon as the reference count | |
95 | goes to zero. This cannot be done in the case of dentries because | |
96 | tearing down of dentries require blocking (dentry_iput()) which | |
97 | isn't supported from RCU callbacks. Instead, tearing down of | |
98 | dentries happen synchronously in dput(), but actual freeing happens | |
99 | later when RCU grace period is over. This allows safe lock-free | |
100 | walking of the hash chains, but a matched dentry may have been | |
101 | partially torn down. The checking of DCACHE_UNHASHED flag with | |
102 | d_lock held detects such dentries and prevents them from being | |
103 | returned from look-up. | |
104 | ||
105 | ||
106 | Maintaining POSIX rename semantics | |
107 | ================================== | |
108 | ||
109 | Since look-up of dentries is lock-free, it can race against a | |
110 | concurrent rename operation. For example, during rename of file A to | |
111 | B, look-up of either A or B must succeed. So, if look-up of B happens | |
112 | after A has been removed from the hash chain but not added to the new | |
113 | hash chain, it may fail. Also, a comparison while the name is being | |
114 | written concurrently by a rename may result in false positive matches | |
115 | violating rename semantics. Issues related to race with rename are | |
116 | handled as described below : | |
117 | ||
118 | 1. Look-up can be done in two ways - d_lookup() which is safe from | |
119 | simultaneous renames and __d_lookup() which is not. If | |
120 | __d_lookup() fails, it must be followed up by a d_lookup() to | |
121 | correctly determine whether a dentry is in the hash table or | |
122 | not. d_lookup() protects look-ups using a sequence lock | |
123 | (rename_lock). | |
124 | ||
125 | 2. The name associated with a dentry (d_name) may be changed if a | |
126 | rename is allowed to happen simultaneously. To avoid memcmp() in | |
127 | __d_lookup() go out of bounds due to a rename and false positive | |
128 | comparison, the name comparison is done while holding the | |
129 | per-dentry lock. This prevents concurrent renames during this | |
130 | operation. | |
131 | ||
132 | 3. Hash table walking during look-up may move to a different bucket as | |
133 | the current dentry is moved to a different bucket due to rename. | |
134 | But we use hlists in dcache hash table and they are | |
135 | null-terminated. So, even if a dentry moves to a different bucket, | |
136 | hash chain walk will terminate. [with a list_head list, it may not | |
137 | since termination is when the list_head in the original bucket is | |
138 | reached]. Since we redo the d_parent check and compare name while | |
139 | holding d_lock, lock-free look-up will not race against d_move(). | |
140 | ||
141 | 4. There can be a theoretical race when a dentry keeps coming back to | |
142 | original bucket due to double moves. Due to this look-up may | |
143 | consider that it has never moved and can end up in a infinite loop. | |
144 | But this is not any worse that theoretical livelocks we already | |
145 | have in the kernel. | |
146 | ||
147 | ||
148 | Important guidelines for filesystem developers related to dcache_rcu | |
149 | ==================================================================== | |
150 | ||
151 | 1. Existing dcache interfaces (pre-2.5.62) exported to filesystem | |
152 | don't change. Only dcache internal implementation changes. However | |
153 | filesystems *must not* delete from the dentry hash chains directly | |
154 | using the list macros like allowed earlier. They must use dcache | |
155 | APIs like d_drop() or __d_drop() depending on the situation. | |
156 | ||
157 | 2. d_flags is now protected by a per-dentry lock (d_lock). All access | |
158 | to d_flags must be protected by it. | |
159 | ||
160 | 3. For a hashed dentry, checking of d_count needs to be protected by | |
161 | d_lock. | |
162 | ||
163 | ||
164 | Papers and other documentation on dcache locking | |
165 | ================================================ | |
166 | ||
167 | 1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). | |
168 | ||
169 | 2. http://lse.sourceforge.net/locking/dcache/dcache.html | |
170 | ||
171 | ||
172 |