]>
Commit | Line | Data |
---|---|---|
649e4368 PM |
1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" |
2 | "http://www.w3.org/TR/html4/loose.dtd"> | |
3 | <html> | |
4 | <head><title>A Tour Through RCU's Requirements [LWN.net]</title> | |
5 | <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> | |
6 | ||
7 | <h1>A Tour Through RCU's Requirements</h1> | |
8 | ||
9 | <p>Copyright IBM Corporation, 2015</p> | |
10 | <p>Author: Paul E. McKenney</p> | |
11 | <p><i>The initial version of this document appeared in the | |
12 | <a href="https://lwn.net/">LWN</a> articles | |
13 | <a href="https://lwn.net/Articles/652156/">here</a>, | |
14 | <a href="https://lwn.net/Articles/652677/">here</a>, and | |
15 | <a href="https://lwn.net/Articles/653326/">here</a>.</i></p> | |
16 | ||
17 | <h2>Introduction</h2> | |
18 | ||
19 | <p> | |
20 | Read-copy update (RCU) is a synchronization mechanism that is often | |
21 | used as a replacement for reader-writer locking. | |
22 | RCU is unusual in that updaters do not block readers, | |
23 | which means that RCU's read-side primitives can be exceedingly fast | |
24 | and scalable. | |
25 | In addition, updaters can make useful forward progress concurrently | |
26 | with readers. | |
27 | However, all this concurrency between RCU readers and updaters does raise | |
28 | the question of exactly what RCU readers are doing, which in turn | |
29 | raises the question of exactly what RCU's requirements are. | |
30 | ||
31 | <p> | |
32 | This document therefore summarizes RCU's requirements, and can be thought | |
33 | of as an informal, high-level specification for RCU. | |
34 | It is important to understand that RCU's specification is primarily | |
35 | empirical in nature; | |
36 | in fact, I learned about many of these requirements the hard way. | |
37 | This situation might cause some consternation, however, not only | |
38 | has this learning process been a lot of fun, but it has also been | |
39 | a great privilege to work with so many people willing to apply | |
40 | technologies in interesting new ways. | |
41 | ||
42 | <p> | |
43 | All that aside, here are the categories of currently known RCU requirements: | |
44 | </p> | |
45 | ||
46 | <ol> | |
47 | <li> <a href="#Fundamental Requirements"> | |
48 | Fundamental Requirements</a> | |
49 | <li> <a href="#Fundamental Non-Requirements">Fundamental Non-Requirements</a> | |
50 | <li> <a href="#Parallelism Facts of Life"> | |
51 | Parallelism Facts of Life</a> | |
52 | <li> <a href="#Quality-of-Implementation Requirements"> | |
53 | Quality-of-Implementation Requirements</a> | |
54 | <li> <a href="#Linux Kernel Complications"> | |
55 | Linux Kernel Complications</a> | |
56 | <li> <a href="#Software-Engineering Requirements"> | |
57 | Software-Engineering Requirements</a> | |
58 | <li> <a href="#Other RCU Flavors"> | |
59 | Other RCU Flavors</a> | |
60 | <li> <a href="#Possible Future Changes"> | |
61 | Possible Future Changes</a> | |
62 | </ol> | |
63 | ||
64 | <p> | |
65 | This is followed by a <a href="#Summary">summary</a>, | |
6146f8df PM |
66 | however, the answers to each quick quiz immediately follows the quiz. |
67 | Select the big white space with your mouse to see the answer. | |
649e4368 PM |
68 | |
69 | <h2><a name="Fundamental Requirements">Fundamental Requirements</a></h2> | |
70 | ||
71 | <p> | |
72 | RCU's fundamental requirements are the closest thing RCU has to hard | |
73 | mathematical requirements. | |
74 | These are: | |
75 | ||
76 | <ol> | |
77 | <li> <a href="#Grace-Period Guarantee"> | |
78 | Grace-Period Guarantee</a> | |
79 | <li> <a href="#Publish-Subscribe Guarantee"> | |
80 | Publish-Subscribe Guarantee</a> | |
4b689330 PM |
81 | <li> <a href="#Memory-Barrier Guarantees"> |
82 | Memory-Barrier Guarantees</a> | |
649e4368 PM |
83 | <li> <a href="#RCU Primitives Guaranteed to Execute Unconditionally"> |
84 | RCU Primitives Guaranteed to Execute Unconditionally</a> | |
85 | <li> <a href="#Guaranteed Read-to-Write Upgrade"> | |
86 | Guaranteed Read-to-Write Upgrade</a> | |
87 | </ol> | |
88 | ||
89 | <h3><a name="Grace-Period Guarantee">Grace-Period Guarantee</a></h3> | |
90 | ||
91 | <p> | |
92 | RCU's grace-period guarantee is unusual in being premeditated: | |
93 | Jack Slingwine and I had this guarantee firmly in mind when we started | |
94 | work on RCU (then called “rclock”) in the early 1990s. | |
95 | That said, the past two decades of experience with RCU have produced | |
96 | a much more detailed understanding of this guarantee. | |
97 | ||
98 | <p> | |
99 | RCU's grace-period guarantee allows updaters to wait for the completion | |
100 | of all pre-existing RCU read-side critical sections. | |
101 | An RCU read-side critical section | |
102 | begins with the marker <tt>rcu_read_lock()</tt> and ends with | |
103 | the marker <tt>rcu_read_unlock()</tt>. | |
104 | These markers may be nested, and RCU treats a nested set as one | |
105 | big RCU read-side critical section. | |
106 | Production-quality implementations of <tt>rcu_read_lock()</tt> and | |
107 | <tt>rcu_read_unlock()</tt> are extremely lightweight, and in | |
108 | fact have exactly zero overhead in Linux kernels built for production | |
109 | use with <tt>CONFIG_PREEMPT=n</tt>. | |
110 | ||
111 | <p> | |
112 | This guarantee allows ordering to be enforced with extremely low | |
113 | overhead to readers, for example: | |
114 | ||
115 | <blockquote> | |
116 | <pre> | |
117 | 1 int x, y; | |
118 | 2 | |
119 | 3 void thread0(void) | |
120 | 4 { | |
121 | 5 rcu_read_lock(); | |
122 | 6 r1 = READ_ONCE(x); | |
123 | 7 r2 = READ_ONCE(y); | |
124 | 8 rcu_read_unlock(); | |
125 | 9 } | |
126 | 10 | |
127 | 11 void thread1(void) | |
128 | 12 { | |
129 | 13 WRITE_ONCE(x, 1); | |
130 | 14 synchronize_rcu(); | |
131 | 15 WRITE_ONCE(y, 1); | |
132 | 16 } | |
133 | </pre> | |
134 | </blockquote> | |
135 | ||
136 | <p> | |
137 | Because the <tt>synchronize_rcu()</tt> on line 14 waits for | |
138 | all pre-existing readers, any instance of <tt>thread0()</tt> that | |
139 | loads a value of zero from <tt>x</tt> must complete before | |
140 | <tt>thread1()</tt> stores to <tt>y</tt>, so that instance must | |
141 | also load a value of zero from <tt>y</tt>. | |
142 | Similarly, any instance of <tt>thread0()</tt> that loads a value of | |
143 | one from <tt>y</tt> must have started after the | |
144 | <tt>synchronize_rcu()</tt> started, and must therefore also load | |
145 | a value of one from <tt>x</tt>. | |
146 | Therefore, the outcome: | |
147 | <blockquote> | |
148 | <pre> | |
149 | (r1 == 0 && r2 == 1) | |
150 | </pre> | |
151 | </blockquote> | |
152 | cannot happen. | |
153 | ||
6146f8df PM |
154 | <table> |
155 | <tr><th> </th></tr> | |
156 | <tr><th align="left">Quick Quiz:</th></tr> | |
157 | <tr><td> | |
158 | Wait a minute! | |
159 | You said that updaters can make useful forward progress concurrently | |
160 | with readers, but pre-existing readers will block | |
161 | <tt>synchronize_rcu()</tt>!!! | |
162 | Just who are you trying to fool??? | |
163 | </td></tr> | |
164 | <tr><th align="left">Answer:</th></tr> | |
165 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
166 | First, if updaters do not wish to be blocked by readers, they can use | |
167 | <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt>, which will | |
168 | be discussed later. | |
169 | Second, even when using <tt>synchronize_rcu()</tt>, the other | |
170 | update-side code does run concurrently with readers, whether | |
171 | pre-existing or not. | |
172 | </font></td></tr> | |
173 | <tr><td> </td></tr> | |
174 | </table> | |
649e4368 PM |
175 | |
176 | <p> | |
177 | This scenario resembles one of the first uses of RCU in | |
178 | <a href="https://en.wikipedia.org/wiki/DYNIX">DYNIX/ptx</a>, | |
179 | which managed a distributed lock manager's transition into | |
180 | a state suitable for handling recovery from node failure, | |
181 | more or less as follows: | |
182 | ||
183 | <blockquote> | |
184 | <pre> | |
185 | 1 #define STATE_NORMAL 0 | |
186 | 2 #define STATE_WANT_RECOVERY 1 | |
187 | 3 #define STATE_RECOVERING 2 | |
188 | 4 #define STATE_WANT_NORMAL 3 | |
189 | 5 | |
190 | 6 int state = STATE_NORMAL; | |
191 | 7 | |
192 | 8 void do_something_dlm(void) | |
193 | 9 { | |
194 | 10 int state_snap; | |
195 | 11 | |
196 | 12 rcu_read_lock(); | |
197 | 13 state_snap = READ_ONCE(state); | |
198 | 14 if (state_snap == STATE_NORMAL) | |
199 | 15 do_something(); | |
200 | 16 else | |
201 | 17 do_something_carefully(); | |
202 | 18 rcu_read_unlock(); | |
203 | 19 } | |
204 | 20 | |
205 | 21 void start_recovery(void) | |
206 | 22 { | |
207 | 23 WRITE_ONCE(state, STATE_WANT_RECOVERY); | |
208 | 24 synchronize_rcu(); | |
209 | 25 WRITE_ONCE(state, STATE_RECOVERING); | |
210 | 26 recovery(); | |
211 | 27 WRITE_ONCE(state, STATE_WANT_NORMAL); | |
212 | 28 synchronize_rcu(); | |
213 | 29 WRITE_ONCE(state, STATE_NORMAL); | |
214 | 30 } | |
215 | </pre> | |
216 | </blockquote> | |
217 | ||
218 | <p> | |
219 | The RCU read-side critical section in <tt>do_something_dlm()</tt> | |
220 | works with the <tt>synchronize_rcu()</tt> in <tt>start_recovery()</tt> | |
221 | to guarantee that <tt>do_something()</tt> never runs concurrently | |
222 | with <tt>recovery()</tt>, but with little or no synchronization | |
223 | overhead in <tt>do_something_dlm()</tt>. | |
224 | ||
6146f8df PM |
225 | <table> |
226 | <tr><th> </th></tr> | |
227 | <tr><th align="left">Quick Quiz:</th></tr> | |
228 | <tr><td> | |
229 | Why is the <tt>synchronize_rcu()</tt> on line 28 needed? | |
230 | </td></tr> | |
231 | <tr><th align="left">Answer:</th></tr> | |
232 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
233 | Without that extra grace period, memory reordering could result in | |
234 | <tt>do_something_dlm()</tt> executing <tt>do_something()</tt> | |
235 | concurrently with the last bits of <tt>recovery()</tt>. | |
236 | </font></td></tr> | |
237 | <tr><td> </td></tr> | |
238 | </table> | |
649e4368 PM |
239 | |
240 | <p> | |
241 | In order to avoid fatal problems such as deadlocks, | |
242 | an RCU read-side critical section must not contain calls to | |
243 | <tt>synchronize_rcu()</tt>. | |
244 | Similarly, an RCU read-side critical section must not | |
245 | contain anything that waits, directly or indirectly, on completion of | |
246 | an invocation of <tt>synchronize_rcu()</tt>. | |
247 | ||
248 | <p> | |
249 | Although RCU's grace-period guarantee is useful in and of itself, with | |
250 | <a href="https://lwn.net/Articles/573497/">quite a few use cases</a>, | |
251 | it would be good to be able to use RCU to coordinate read-side | |
252 | access to linked data structures. | |
253 | For this, the grace-period guarantee is not sufficient, as can | |
254 | be seen in function <tt>add_gp_buggy()</tt> below. | |
255 | We will look at the reader's code later, but in the meantime, just think of | |
256 | the reader as locklessly picking up the <tt>gp</tt> pointer, | |
257 | and, if the value loaded is non-<tt>NULL</tt>, locklessly accessing the | |
258 | <tt>->a</tt> and <tt>->b</tt> fields. | |
259 | ||
260 | <blockquote> | |
261 | <pre> | |
262 | 1 bool add_gp_buggy(int a, int b) | |
263 | 2 { | |
264 | 3 p = kmalloc(sizeof(*p), GFP_KERNEL); | |
265 | 4 if (!p) | |
266 | 5 return -ENOMEM; | |
267 | 6 spin_lock(&gp_lock); | |
268 | 7 if (rcu_access_pointer(gp)) { | |
269 | 8 spin_unlock(&gp_lock); | |
270 | 9 return false; | |
271 | 10 } | |
272 | 11 p->a = a; | |
273 | 12 p->b = a; | |
274 | 13 gp = p; /* ORDERING BUG */ | |
275 | 14 spin_unlock(&gp_lock); | |
276 | 15 return true; | |
277 | 16 } | |
278 | </pre> | |
279 | </blockquote> | |
280 | ||
281 | <p> | |
282 | The problem is that both the compiler and weakly ordered CPUs are within | |
283 | their rights to reorder this code as follows: | |
284 | ||
285 | <blockquote> | |
286 | <pre> | |
287 | 1 bool add_gp_buggy_optimized(int a, int b) | |
288 | 2 { | |
289 | 3 p = kmalloc(sizeof(*p), GFP_KERNEL); | |
290 | 4 if (!p) | |
291 | 5 return -ENOMEM; | |
292 | 6 spin_lock(&gp_lock); | |
293 | 7 if (rcu_access_pointer(gp)) { | |
294 | 8 spin_unlock(&gp_lock); | |
295 | 9 return false; | |
296 | 10 } | |
297 | <b>11 gp = p; /* ORDERING BUG */ | |
298 | 12 p->a = a; | |
299 | 13 p->b = a;</b> | |
300 | 14 spin_unlock(&gp_lock); | |
301 | 15 return true; | |
302 | 16 } | |
303 | </pre> | |
304 | </blockquote> | |
305 | ||
306 | <p> | |
307 | If an RCU reader fetches <tt>gp</tt> just after | |
308 | <tt>add_gp_buggy_optimized</tt> executes line 11, | |
309 | it will see garbage in the <tt>->a</tt> and <tt>->b</tt> | |
310 | fields. | |
311 | And this is but one of many ways in which compiler and hardware optimizations | |
312 | could cause trouble. | |
313 | Therefore, we clearly need some way to prevent the compiler and the CPU from | |
314 | reordering in this manner, which brings us to the publish-subscribe | |
315 | guarantee discussed in the next section. | |
316 | ||
317 | <h3><a name="Publish-Subscribe Guarantee">Publish/Subscribe Guarantee</a></h3> | |
318 | ||
319 | <p> | |
320 | RCU's publish-subscribe guarantee allows data to be inserted | |
321 | into a linked data structure without disrupting RCU readers. | |
322 | The updater uses <tt>rcu_assign_pointer()</tt> to insert the | |
323 | new data, and readers use <tt>rcu_dereference()</tt> to | |
324 | access data, whether new or old. | |
325 | The following shows an example of insertion: | |
326 | ||
327 | <blockquote> | |
328 | <pre> | |
329 | 1 bool add_gp(int a, int b) | |
330 | 2 { | |
331 | 3 p = kmalloc(sizeof(*p), GFP_KERNEL); | |
332 | 4 if (!p) | |
333 | 5 return -ENOMEM; | |
334 | 6 spin_lock(&gp_lock); | |
335 | 7 if (rcu_access_pointer(gp)) { | |
336 | 8 spin_unlock(&gp_lock); | |
337 | 9 return false; | |
338 | 10 } | |
339 | 11 p->a = a; | |
340 | 12 p->b = a; | |
341 | 13 rcu_assign_pointer(gp, p); | |
342 | 14 spin_unlock(&gp_lock); | |
343 | 15 return true; | |
344 | 16 } | |
345 | </pre> | |
346 | </blockquote> | |
347 | ||
348 | <p> | |
349 | The <tt>rcu_assign_pointer()</tt> on line 13 is conceptually | |
350 | equivalent to a simple assignment statement, but also guarantees | |
351 | that its assignment will | |
352 | happen after the two assignments in lines 11 and 12, | |
353 | similar to the C11 <tt>memory_order_release</tt> store operation. | |
354 | It also prevents any number of “interesting” compiler | |
355 | optimizations, for example, the use of <tt>gp</tt> as a scratch | |
356 | location immediately preceding the assignment. | |
357 | ||
6146f8df PM |
358 | <table> |
359 | <tr><th> </th></tr> | |
360 | <tr><th align="left">Quick Quiz:</th></tr> | |
361 | <tr><td> | |
362 | But <tt>rcu_assign_pointer()</tt> does nothing to prevent the | |
363 | two assignments to <tt>p->a</tt> and <tt>p->b</tt> | |
364 | from being reordered. | |
365 | Can't that also cause problems? | |
366 | </td></tr> | |
367 | <tr><th align="left">Answer:</th></tr> | |
368 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
369 | No, it cannot. | |
370 | The readers cannot see either of these two fields until | |
371 | the assignment to <tt>gp</tt>, by which time both fields are | |
372 | fully initialized. | |
373 | So reordering the assignments | |
374 | to <tt>p->a</tt> and <tt>p->b</tt> cannot possibly | |
375 | cause any problems. | |
376 | </font></td></tr> | |
377 | <tr><td> </td></tr> | |
378 | </table> | |
649e4368 PM |
379 | |
380 | <p> | |
381 | It is tempting to assume that the reader need not do anything special | |
382 | to control its accesses to the RCU-protected data, | |
383 | as shown in <tt>do_something_gp_buggy()</tt> below: | |
384 | ||
385 | <blockquote> | |
386 | <pre> | |
387 | 1 bool do_something_gp_buggy(void) | |
388 | 2 { | |
389 | 3 rcu_read_lock(); | |
390 | 4 p = gp; /* OPTIMIZATIONS GALORE!!! */ | |
391 | 5 if (p) { | |
392 | 6 do_something(p->a, p->b); | |
393 | 7 rcu_read_unlock(); | |
394 | 8 return true; | |
395 | 9 } | |
396 | 10 rcu_read_unlock(); | |
397 | 11 return false; | |
398 | 12 } | |
399 | </pre> | |
400 | </blockquote> | |
401 | ||
402 | <p> | |
403 | However, this temptation must be resisted because there are a | |
404 | surprisingly large number of ways that the compiler | |
405 | (to say nothing of | |
406 | <a href="https://h71000.www7.hp.com/wizard/wiz_2637.html">DEC Alpha CPUs</a>) | |
407 | can trip this code up. | |
408 | For but one example, if the compiler were short of registers, it | |
409 | might choose to refetch from <tt>gp</tt> rather than keeping | |
410 | a separate copy in <tt>p</tt> as follows: | |
411 | ||
412 | <blockquote> | |
413 | <pre> | |
414 | 1 bool do_something_gp_buggy_optimized(void) | |
415 | 2 { | |
416 | 3 rcu_read_lock(); | |
417 | 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */ | |
418 | <b> 5 do_something(gp->a, gp->b);</b> | |
419 | 6 rcu_read_unlock(); | |
420 | 7 return true; | |
421 | 8 } | |
422 | 9 rcu_read_unlock(); | |
423 | 10 return false; | |
424 | 11 } | |
425 | </pre> | |
426 | </blockquote> | |
427 | ||
428 | <p> | |
429 | If this function ran concurrently with a series of updates that | |
430 | replaced the current structure with a new one, | |
431 | the fetches of <tt>gp->a</tt> | |
432 | and <tt>gp->b</tt> might well come from two different structures, | |
433 | which could cause serious confusion. | |
434 | To prevent this (and much else besides), <tt>do_something_gp()</tt> uses | |
435 | <tt>rcu_dereference()</tt> to fetch from <tt>gp</tt>: | |
436 | ||
437 | <blockquote> | |
438 | <pre> | |
439 | 1 bool do_something_gp(void) | |
440 | 2 { | |
441 | 3 rcu_read_lock(); | |
442 | 4 p = rcu_dereference(gp); | |
443 | 5 if (p) { | |
444 | 6 do_something(p->a, p->b); | |
445 | 7 rcu_read_unlock(); | |
446 | 8 return true; | |
447 | 9 } | |
448 | 10 rcu_read_unlock(); | |
449 | 11 return false; | |
450 | 12 } | |
451 | </pre> | |
452 | </blockquote> | |
453 | ||
454 | <p> | |
455 | The <tt>rcu_dereference()</tt> uses volatile casts and (for DEC Alpha) | |
456 | memory barriers in the Linux kernel. | |
457 | Should a | |
458 | <a href="http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf">high-quality implementation of C11 <tt>memory_order_consume</tt> [PDF]</a> | |
459 | ever appear, then <tt>rcu_dereference()</tt> could be implemented | |
460 | as a <tt>memory_order_consume</tt> load. | |
461 | Regardless of the exact implementation, a pointer fetched by | |
462 | <tt>rcu_dereference()</tt> may not be used outside of the | |
463 | outermost RCU read-side critical section containing that | |
464 | <tt>rcu_dereference()</tt>, unless protection of | |
465 | the corresponding data element has been passed from RCU to some | |
466 | other synchronization mechanism, most commonly locking or | |
467 | <a href="https://www.kernel.org/doc/Documentation/RCU/rcuref.txt">reference counting</a>. | |
468 | ||
469 | <p> | |
470 | In short, updaters use <tt>rcu_assign_pointer()</tt> and readers | |
471 | use <tt>rcu_dereference()</tt>, and these two RCU API elements | |
472 | work together to ensure that readers have a consistent view of | |
473 | newly added data elements. | |
474 | ||
475 | <p> | |
476 | Of course, it is also necessary to remove elements from RCU-protected | |
477 | data structures, for example, using the following process: | |
478 | ||
479 | <ol> | |
480 | <li> Remove the data element from the enclosing structure. | |
481 | <li> Wait for all pre-existing RCU read-side critical sections | |
482 | to complete (because only pre-existing readers can possibly have | |
483 | a reference to the newly removed data element). | |
484 | <li> At this point, only the updater has a reference to the | |
485 | newly removed data element, so it can safely reclaim | |
486 | the data element, for example, by passing it to <tt>kfree()</tt>. | |
487 | </ol> | |
488 | ||
489 | This process is implemented by <tt>remove_gp_synchronous()</tt>: | |
490 | ||
491 | <blockquote> | |
492 | <pre> | |
493 | 1 bool remove_gp_synchronous(void) | |
494 | 2 { | |
495 | 3 struct foo *p; | |
496 | 4 | |
497 | 5 spin_lock(&gp_lock); | |
498 | 6 p = rcu_access_pointer(gp); | |
499 | 7 if (!p) { | |
500 | 8 spin_unlock(&gp_lock); | |
501 | 9 return false; | |
502 | 10 } | |
503 | 11 rcu_assign_pointer(gp, NULL); | |
504 | 12 spin_unlock(&gp_lock); | |
505 | 13 synchronize_rcu(); | |
506 | 14 kfree(p); | |
507 | 15 return true; | |
508 | 16 } | |
509 | </pre> | |
510 | </blockquote> | |
511 | ||
512 | <p> | |
513 | This function is straightforward, with line 13 waiting for a grace | |
514 | period before line 14 frees the old data element. | |
515 | This waiting ensures that readers will reach line 7 of | |
516 | <tt>do_something_gp()</tt> before the data element referenced by | |
517 | <tt>p</tt> is freed. | |
518 | The <tt>rcu_access_pointer()</tt> on line 6 is similar to | |
519 | <tt>rcu_dereference()</tt>, except that: | |
520 | ||
521 | <ol> | |
522 | <li> The value returned by <tt>rcu_access_pointer()</tt> | |
523 | cannot be dereferenced. | |
524 | If you want to access the value pointed to as well as | |
525 | the pointer itself, use <tt>rcu_dereference()</tt> | |
526 | instead of <tt>rcu_access_pointer()</tt>. | |
527 | <li> The call to <tt>rcu_access_pointer()</tt> need not be | |
528 | protected. | |
529 | In contrast, <tt>rcu_dereference()</tt> must either be | |
530 | within an RCU read-side critical section or in a code | |
531 | segment where the pointer cannot change, for example, in | |
532 | code protected by the corresponding update-side lock. | |
533 | </ol> | |
534 | ||
6146f8df PM |
535 | <table> |
536 | <tr><th> </th></tr> | |
537 | <tr><th align="left">Quick Quiz:</th></tr> | |
538 | <tr><td> | |
539 | Without the <tt>rcu_dereference()</tt> or the | |
540 | <tt>rcu_access_pointer()</tt>, what destructive optimizations | |
541 | might the compiler make use of? | |
542 | </td></tr> | |
543 | <tr><th align="left">Answer:</th></tr> | |
544 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
545 | Let's start with what happens to <tt>do_something_gp()</tt> | |
546 | if it fails to use <tt>rcu_dereference()</tt>. | |
547 | It could reuse a value formerly fetched from this same pointer. | |
548 | It could also fetch the pointer from <tt>gp</tt> in a byte-at-a-time | |
549 | manner, resulting in <i>load tearing</i>, in turn resulting a bytewise | |
e2c85cb1 | 550 | mash-up of two distinct pointer values. |
6146f8df PM |
551 | It might even use value-speculation optimizations, where it makes |
552 | a wrong guess, but by the time it gets around to checking the | |
553 | value, an update has changed the pointer to match the wrong guess. | |
554 | Too bad about any dereferences that returned pre-initialization garbage | |
555 | in the meantime! | |
556 | </font> | |
557 | ||
558 | <p><font color="ffffff"> | |
559 | For <tt>remove_gp_synchronous()</tt>, as long as all modifications | |
560 | to <tt>gp</tt> are carried out while holding <tt>gp_lock</tt>, | |
561 | the above optimizations are harmless. | |
41a2901e | 562 | However, <tt>sparse</tt> will complain if you |
6146f8df PM |
563 | define <tt>gp</tt> with <tt>__rcu</tt> and then |
564 | access it without using | |
565 | either <tt>rcu_access_pointer()</tt> or <tt>rcu_dereference()</tt>. | |
566 | </font></td></tr> | |
567 | <tr><td> </td></tr> | |
568 | </table> | |
649e4368 PM |
569 | |
570 | <p> | |
4b689330 PM |
571 | In short, RCU's publish-subscribe guarantee is provided by the combination |
572 | of <tt>rcu_assign_pointer()</tt> and <tt>rcu_dereference()</tt>. | |
573 | This guarantee allows data elements to be safely added to RCU-protected | |
574 | linked data structures without disrupting RCU readers. | |
575 | This guarantee can be used in combination with the grace-period | |
576 | guarantee to also allow data elements to be removed from RCU-protected | |
577 | linked data structures, again without disrupting RCU readers. | |
578 | ||
579 | <p> | |
580 | This guarantee was only partially premeditated. | |
581 | DYNIX/ptx used an explicit memory barrier for publication, but had nothing | |
582 | resembling <tt>rcu_dereference()</tt> for subscription, nor did it | |
583 | have anything resembling the <tt>smp_read_barrier_depends()</tt> | |
9ad3c143 PM |
584 | that was later subsumed into <tt>rcu_dereference()</tt> and later |
585 | still into <tt>READ_ONCE()</tt>. | |
4b689330 PM |
586 | The need for these operations made itself known quite suddenly at a |
587 | late-1990s meeting with the DEC Alpha architects, back in the days when | |
588 | DEC was still a free-standing company. | |
589 | It took the Alpha architects a good hour to convince me that any sort | |
590 | of barrier would ever be needed, and it then took me a good <i>two</i> hours | |
591 | to convince them that their documentation did not make this point clear. | |
592 | More recent work with the C and C++ standards committees have provided | |
593 | much education on tricks and traps from the compiler. | |
594 | In short, compilers were much less tricky in the early 1990s, but in | |
595 | 2015, don't even think about omitting <tt>rcu_dereference()</tt>! | |
596 | ||
597 | <h3><a name="Memory-Barrier Guarantees">Memory-Barrier Guarantees</a></h3> | |
598 | ||
599 | <p> | |
600 | The previous section's simple linked-data-structure scenario clearly | |
601 | demonstrates the need for RCU's stringent memory-ordering guarantees on | |
602 | systems with more than one CPU: | |
649e4368 PM |
603 | |
604 | <ol> | |
605 | <li> Each CPU that has an RCU read-side critical section that | |
606 | begins before <tt>synchronize_rcu()</tt> starts is | |
607 | guaranteed to execute a full memory barrier between the time | |
608 | that the RCU read-side critical section ends and the time that | |
609 | <tt>synchronize_rcu()</tt> returns. | |
610 | Without this guarantee, a pre-existing RCU read-side critical section | |
611 | might hold a reference to the newly removed <tt>struct foo</tt> | |
612 | after the <tt>kfree()</tt> on line 14 of | |
613 | <tt>remove_gp_synchronous()</tt>. | |
614 | <li> Each CPU that has an RCU read-side critical section that ends | |
615 | after <tt>synchronize_rcu()</tt> returns is guaranteed | |
616 | to execute a full memory barrier between the time that | |
617 | <tt>synchronize_rcu()</tt> begins and the time that the RCU | |
618 | read-side critical section begins. | |
619 | Without this guarantee, a later RCU read-side critical section | |
620 | running after the <tt>kfree()</tt> on line 14 of | |
621 | <tt>remove_gp_synchronous()</tt> might | |
622 | later run <tt>do_something_gp()</tt> and find the | |
623 | newly deleted <tt>struct foo</tt>. | |
624 | <li> If the task invoking <tt>synchronize_rcu()</tt> remains | |
625 | on a given CPU, then that CPU is guaranteed to execute a full | |
626 | memory barrier sometime during the execution of | |
627 | <tt>synchronize_rcu()</tt>. | |
628 | This guarantee ensures that the <tt>kfree()</tt> on | |
629 | line 14 of <tt>remove_gp_synchronous()</tt> really does | |
630 | execute after the removal on line 11. | |
631 | <li> If the task invoking <tt>synchronize_rcu()</tt> migrates | |
632 | among a group of CPUs during that invocation, then each of the | |
633 | CPUs in that group is guaranteed to execute a full memory barrier | |
634 | sometime during the execution of <tt>synchronize_rcu()</tt>. | |
635 | This guarantee also ensures that the <tt>kfree()</tt> on | |
636 | line 14 of <tt>remove_gp_synchronous()</tt> really does | |
637 | execute after the removal on | |
638 | line 11, but also in the case where the thread executing the | |
639 | <tt>synchronize_rcu()</tt> migrates in the meantime. | |
640 | </ol> | |
641 | ||
6146f8df PM |
642 | <table> |
643 | <tr><th> </th></tr> | |
644 | <tr><th align="left">Quick Quiz:</th></tr> | |
645 | <tr><td> | |
646 | Given that multiple CPUs can start RCU read-side critical sections | |
647 | at any time without any ordering whatsoever, how can RCU possibly | |
648 | tell whether or not a given RCU read-side critical section starts | |
649 | before a given instance of <tt>synchronize_rcu()</tt>? | |
650 | </td></tr> | |
651 | <tr><th align="left">Answer:</th></tr> | |
652 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
653 | If RCU cannot tell whether or not a given | |
654 | RCU read-side critical section starts before a | |
655 | given instance of <tt>synchronize_rcu()</tt>, | |
656 | then it must assume that the RCU read-side critical section | |
657 | started first. | |
658 | In other words, a given instance of <tt>synchronize_rcu()</tt> | |
659 | can avoid waiting on a given RCU read-side critical section only | |
660 | if it can prove that <tt>synchronize_rcu()</tt> started first. | |
6771853b | 661 | </font> |
e2c85cb1 | 662 | |
6771853b | 663 | <p><font color="ffffff"> |
e2c85cb1 PM |
664 | A related question is “When <tt>rcu_read_lock()</tt> |
665 | doesn't generate any code, why does it matter how it relates | |
666 | to a grace period?” | |
667 | The answer is that it is not the relationship of | |
668 | <tt>rcu_read_lock()</tt> itself that is important, but rather | |
669 | the relationship of the code within the enclosed RCU read-side | |
670 | critical section to the code preceding and following the | |
671 | grace period. | |
672 | If we take this viewpoint, then a given RCU read-side critical | |
673 | section begins before a given grace period when some access | |
674 | preceding the grace period observes the effect of some access | |
675 | within the critical section, in which case none of the accesses | |
676 | within the critical section may observe the effects of any | |
677 | access following the grace period. | |
6771853b | 678 | </font> |
e2c85cb1 | 679 | |
6771853b | 680 | <p><font color="ffffff"> |
e2c85cb1 PM |
681 | As of late 2016, mathematical models of RCU take this |
682 | viewpoint, for example, see slides 62 and 63 | |
683 | of the | |
684 | <a href="http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.2016.10.04c.LCE.pdf">2016 LinuxCon EU</a> | |
685 | presentation. | |
6146f8df PM |
686 | </font></td></tr> |
687 | <tr><td> </td></tr> | |
688 | </table> | |
689 | ||
690 | <table> | |
691 | <tr><th> </th></tr> | |
692 | <tr><th align="left">Quick Quiz:</th></tr> | |
693 | <tr><td> | |
694 | The first and second guarantees require unbelievably strict ordering! | |
695 | Are all these memory barriers <i> really</i> required? | |
696 | </td></tr> | |
697 | <tr><th align="left">Answer:</th></tr> | |
698 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
699 | Yes, they really are required. | |
700 | To see why the first guarantee is required, consider the following | |
701 | sequence of events: | |
702 | </font> | |
703 | ||
704 | <ol> | |
705 | <li> <font color="ffffff"> | |
706 | CPU 1: <tt>rcu_read_lock()</tt> | |
707 | </font> | |
708 | <li> <font color="ffffff"> | |
709 | CPU 1: <tt>q = rcu_dereference(gp); | |
710 | /* Very likely to return p. */</tt> | |
711 | </font> | |
712 | <li> <font color="ffffff"> | |
713 | CPU 0: <tt>list_del_rcu(p);</tt> | |
714 | </font> | |
715 | <li> <font color="ffffff"> | |
716 | CPU 0: <tt>synchronize_rcu()</tt> starts. | |
717 | </font> | |
718 | <li> <font color="ffffff"> | |
719 | CPU 1: <tt>do_something_with(q->a); | |
720 | /* No smp_mb(), so might happen after kfree(). */</tt> | |
721 | </font> | |
722 | <li> <font color="ffffff"> | |
723 | CPU 1: <tt>rcu_read_unlock()</tt> | |
724 | </font> | |
725 | <li> <font color="ffffff"> | |
726 | CPU 0: <tt>synchronize_rcu()</tt> returns. | |
727 | </font> | |
728 | <li> <font color="ffffff"> | |
729 | CPU 0: <tt>kfree(p);</tt> | |
730 | </font> | |
731 | </ol> | |
732 | ||
733 | <p><font color="ffffff"> | |
734 | Therefore, there absolutely must be a full memory barrier between the | |
735 | end of the RCU read-side critical section and the end of the | |
736 | grace period. | |
737 | </font> | |
738 | ||
739 | <p><font color="ffffff"> | |
740 | The sequence of events demonstrating the necessity of the second rule | |
741 | is roughly similar: | |
742 | </font> | |
743 | ||
744 | <ol> | |
745 | <li> <font color="ffffff">CPU 0: <tt>list_del_rcu(p);</tt> | |
746 | </font> | |
747 | <li> <font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> starts. | |
748 | </font> | |
749 | <li> <font color="ffffff">CPU 1: <tt>rcu_read_lock()</tt> | |
750 | </font> | |
751 | <li> <font color="ffffff">CPU 1: <tt>q = rcu_dereference(gp); | |
752 | /* Might return p if no memory barrier. */</tt> | |
753 | </font> | |
754 | <li> <font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> returns. | |
755 | </font> | |
756 | <li> <font color="ffffff">CPU 0: <tt>kfree(p);</tt> | |
757 | </font> | |
758 | <li> <font color="ffffff"> | |
759 | CPU 1: <tt>do_something_with(q->a); /* Boom!!! */</tt> | |
760 | </font> | |
761 | <li> <font color="ffffff">CPU 1: <tt>rcu_read_unlock()</tt> | |
762 | </font> | |
763 | </ol> | |
764 | ||
765 | <p><font color="ffffff"> | |
766 | And similarly, without a memory barrier between the beginning of the | |
767 | grace period and the beginning of the RCU read-side critical section, | |
768 | CPU 1 might end up accessing the freelist. | |
769 | </font> | |
770 | ||
771 | <p><font color="ffffff"> | |
772 | The “as if” rule of course applies, so that any | |
773 | implementation that acts as if the appropriate memory barriers | |
774 | were in place is a correct implementation. | |
775 | That said, it is much easier to fool yourself into believing | |
776 | that you have adhered to the as-if rule than it is to actually | |
777 | adhere to it! | |
778 | </font></td></tr> | |
779 | <tr><td> </td></tr> | |
780 | </table> | |
781 | ||
782 | <table> | |
783 | <tr><th> </th></tr> | |
784 | <tr><th align="left">Quick Quiz:</th></tr> | |
785 | <tr><td> | |
786 | You claim that <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> | |
787 | generate absolutely no code in some kernel builds. | |
788 | This means that the compiler might arbitrarily rearrange consecutive | |
789 | RCU read-side critical sections. | |
790 | Given such rearrangement, if a given RCU read-side critical section | |
791 | is done, how can you be sure that all prior RCU read-side critical | |
792 | sections are done? | |
793 | Won't the compiler rearrangements make that impossible to determine? | |
794 | </td></tr> | |
795 | <tr><th align="left">Answer:</th></tr> | |
796 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
797 | In cases where <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> | |
798 | generate absolutely no code, RCU infers quiescent states only at | |
799 | special locations, for example, within the scheduler. | |
800 | Because calls to <tt>schedule()</tt> had better prevent calling-code | |
801 | accesses to shared variables from being rearranged across the call to | |
802 | <tt>schedule()</tt>, if RCU detects the end of a given RCU read-side | |
803 | critical section, it will necessarily detect the end of all prior | |
804 | RCU read-side critical sections, no matter how aggressively the | |
805 | compiler scrambles the code. | |
806 | </font> | |
807 | ||
808 | <p><font color="ffffff"> | |
809 | Again, this all assumes that the compiler cannot scramble code across | |
810 | calls to the scheduler, out of interrupt handlers, into the idle loop, | |
811 | into user-mode code, and so on. | |
812 | But if your kernel build allows that sort of scrambling, you have broken | |
813 | far more than just RCU! | |
814 | </font></td></tr> | |
815 | <tr><td> </td></tr> | |
816 | </table> | |
d8936c0b | 817 | |
649e4368 | 818 | <p> |
4b689330 PM |
819 | Note that these memory-barrier requirements do not replace the fundamental |
820 | RCU requirement that a grace period wait for all pre-existing readers. | |
821 | On the contrary, the memory barriers called out in this section must operate in | |
822 | such a way as to <i>enforce</i> this fundamental requirement. | |
823 | Of course, different implementations enforce this requirement in different | |
824 | ways, but enforce it they must. | |
649e4368 PM |
825 | |
826 | <h3><a name="RCU Primitives Guaranteed to Execute Unconditionally">RCU Primitives Guaranteed to Execute Unconditionally</a></h3> | |
827 | ||
828 | <p> | |
829 | The common-case RCU primitives are unconditional. | |
830 | They are invoked, they do their job, and they return, with no possibility | |
831 | of error, and no need to retry. | |
832 | This is a key RCU design philosophy. | |
833 | ||
834 | <p> | |
835 | However, this philosophy is pragmatic rather than pigheaded. | |
836 | If someone comes up with a good justification for a particular conditional | |
837 | RCU primitive, it might well be implemented and added. | |
838 | After all, this guarantee was reverse-engineered, not premeditated. | |
839 | The unconditional nature of the RCU primitives was initially an | |
840 | accident of implementation, and later experience with synchronization | |
841 | primitives with conditional primitives caused me to elevate this | |
842 | accident to a guarantee. | |
843 | Therefore, the justification for adding a conditional primitive to | |
844 | RCU would need to be based on detailed and compelling use cases. | |
845 | ||
846 | <h3><a name="Guaranteed Read-to-Write Upgrade">Guaranteed Read-to-Write Upgrade</a></h3> | |
847 | ||
848 | <p> | |
849 | As far as RCU is concerned, it is always possible to carry out an | |
850 | update within an RCU read-side critical section. | |
851 | For example, that RCU read-side critical section might search for | |
852 | a given data element, and then might acquire the update-side | |
853 | spinlock in order to update that element, all while remaining | |
854 | in that RCU read-side critical section. | |
855 | Of course, it is necessary to exit the RCU read-side critical section | |
856 | before invoking <tt>synchronize_rcu()</tt>, however, this | |
857 | inconvenience can be avoided through use of the | |
858 | <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt> API members | |
859 | described later in this document. | |
860 | ||
6146f8df PM |
861 | <table> |
862 | <tr><th> </th></tr> | |
863 | <tr><th align="left">Quick Quiz:</th></tr> | |
864 | <tr><td> | |
865 | But how does the upgrade-to-write operation exclude other readers? | |
866 | </td></tr> | |
867 | <tr><th align="left">Answer:</th></tr> | |
868 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
869 | It doesn't, just like normal RCU updates, which also do not exclude | |
870 | RCU readers. | |
871 | </font></td></tr> | |
872 | <tr><td> </td></tr> | |
873 | </table> | |
649e4368 PM |
874 | |
875 | <p> | |
876 | This guarantee allows lookup code to be shared between read-side | |
877 | and update-side code, and was premeditated, appearing in the earliest | |
878 | DYNIX/ptx RCU documentation. | |
879 | ||
880 | <h2><a name="Fundamental Non-Requirements">Fundamental Non-Requirements</a></h2> | |
881 | ||
882 | <p> | |
883 | RCU provides extremely lightweight readers, and its read-side guarantees, | |
884 | though quite useful, are correspondingly lightweight. | |
885 | It is therefore all too easy to assume that RCU is guaranteeing more | |
886 | than it really is. | |
887 | Of course, the list of things that RCU does not guarantee is infinitely | |
888 | long, however, the following sections list a few non-guarantees that | |
889 | have caused confusion. | |
890 | Except where otherwise noted, these non-guarantees were premeditated. | |
891 | ||
892 | <ol> | |
893 | <li> <a href="#Readers Impose Minimal Ordering"> | |
894 | Readers Impose Minimal Ordering</a> | |
895 | <li> <a href="#Readers Do Not Exclude Updaters"> | |
896 | Readers Do Not Exclude Updaters</a> | |
897 | <li> <a href="#Updaters Only Wait For Old Readers"> | |
898 | Updaters Only Wait For Old Readers</a> | |
899 | <li> <a href="#Grace Periods Don't Partition Read-Side Critical Sections"> | |
900 | Grace Periods Don't Partition Read-Side Critical Sections</a> | |
901 | <li> <a href="#Read-Side Critical Sections Don't Partition Grace Periods"> | |
902 | Read-Side Critical Sections Don't Partition Grace Periods</a> | |
903 | <li> <a href="#Disabling Preemption Does Not Block Grace Periods"> | |
904 | Disabling Preemption Does Not Block Grace Periods</a> | |
905 | </ol> | |
906 | ||
907 | <h3><a name="Readers Impose Minimal Ordering">Readers Impose Minimal Ordering</a></h3> | |
908 | ||
909 | <p> | |
910 | Reader-side markers such as <tt>rcu_read_lock()</tt> and | |
911 | <tt>rcu_read_unlock()</tt> provide absolutely no ordering guarantees | |
912 | except through their interaction with the grace-period APIs such as | |
913 | <tt>synchronize_rcu()</tt>. | |
914 | To see this, consider the following pair of threads: | |
915 | ||
916 | <blockquote> | |
917 | <pre> | |
918 | 1 void thread0(void) | |
919 | 2 { | |
920 | 3 rcu_read_lock(); | |
921 | 4 WRITE_ONCE(x, 1); | |
922 | 5 rcu_read_unlock(); | |
923 | 6 rcu_read_lock(); | |
924 | 7 WRITE_ONCE(y, 1); | |
925 | 8 rcu_read_unlock(); | |
926 | 9 } | |
927 | 10 | |
928 | 11 void thread1(void) | |
929 | 12 { | |
930 | 13 rcu_read_lock(); | |
931 | 14 r1 = READ_ONCE(y); | |
932 | 15 rcu_read_unlock(); | |
933 | 16 rcu_read_lock(); | |
934 | 17 r2 = READ_ONCE(x); | |
935 | 18 rcu_read_unlock(); | |
936 | 19 } | |
937 | </pre> | |
938 | </blockquote> | |
939 | ||
940 | <p> | |
941 | After <tt>thread0()</tt> and <tt>thread1()</tt> execute | |
942 | concurrently, it is quite possible to have | |
943 | ||
944 | <blockquote> | |
945 | <pre> | |
946 | (r1 == 1 && r2 == 0) | |
947 | </pre> | |
948 | </blockquote> | |
949 | ||
950 | (that is, <tt>y</tt> appears to have been assigned before <tt>x</tt>), | |
951 | which would not be possible if <tt>rcu_read_lock()</tt> and | |
952 | <tt>rcu_read_unlock()</tt> had much in the way of ordering | |
953 | properties. | |
954 | But they do not, so the CPU is within its rights | |
955 | to do significant reordering. | |
956 | This is by design: Any significant ordering constraints would slow down | |
957 | these fast-path APIs. | |
958 | ||
6146f8df PM |
959 | <table> |
960 | <tr><th> </th></tr> | |
961 | <tr><th align="left">Quick Quiz:</th></tr> | |
962 | <tr><td> | |
963 | Can't the compiler also reorder this code? | |
964 | </td></tr> | |
965 | <tr><th align="left">Answer:</th></tr> | |
966 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
967 | No, the volatile casts in <tt>READ_ONCE()</tt> and | |
968 | <tt>WRITE_ONCE()</tt> prevent the compiler from reordering in | |
969 | this particular case. | |
970 | </font></td></tr> | |
971 | <tr><td> </td></tr> | |
972 | </table> | |
649e4368 PM |
973 | |
974 | <h3><a name="Readers Do Not Exclude Updaters">Readers Do Not Exclude Updaters</a></h3> | |
975 | ||
976 | <p> | |
977 | Neither <tt>rcu_read_lock()</tt> nor <tt>rcu_read_unlock()</tt> | |
978 | exclude updates. | |
979 | All they do is to prevent grace periods from ending. | |
980 | The following example illustrates this: | |
981 | ||
982 | <blockquote> | |
983 | <pre> | |
984 | 1 void thread0(void) | |
985 | 2 { | |
986 | 3 rcu_read_lock(); | |
987 | 4 r1 = READ_ONCE(y); | |
988 | 5 if (r1) { | |
989 | 6 do_something_with_nonzero_x(); | |
990 | 7 r2 = READ_ONCE(x); | |
991 | 8 WARN_ON(!r2); /* BUG!!! */ | |
992 | 9 } | |
993 | 10 rcu_read_unlock(); | |
994 | 11 } | |
995 | 12 | |
996 | 13 void thread1(void) | |
997 | 14 { | |
998 | 15 spin_lock(&my_lock); | |
999 | 16 WRITE_ONCE(x, 1); | |
1000 | 17 WRITE_ONCE(y, 1); | |
1001 | 18 spin_unlock(&my_lock); | |
1002 | 19 } | |
1003 | </pre> | |
1004 | </blockquote> | |
1005 | ||
1006 | <p> | |
1007 | If the <tt>thread0()</tt> function's <tt>rcu_read_lock()</tt> | |
1008 | excluded the <tt>thread1()</tt> function's update, | |
1009 | the <tt>WARN_ON()</tt> could never fire. | |
1010 | But the fact is that <tt>rcu_read_lock()</tt> does not exclude | |
1011 | much of anything aside from subsequent grace periods, of which | |
1012 | <tt>thread1()</tt> has none, so the | |
1013 | <tt>WARN_ON()</tt> can and does fire. | |
1014 | ||
1015 | <h3><a name="Updaters Only Wait For Old Readers">Updaters Only Wait For Old Readers</a></h3> | |
1016 | ||
1017 | <p> | |
1018 | It might be tempting to assume that after <tt>synchronize_rcu()</tt> | |
1019 | completes, there are no readers executing. | |
1020 | This temptation must be avoided because | |
1021 | new readers can start immediately after <tt>synchronize_rcu()</tt> | |
1022 | starts, and <tt>synchronize_rcu()</tt> is under no | |
1023 | obligation to wait for these new readers. | |
1024 | ||
6146f8df PM |
1025 | <table> |
1026 | <tr><th> </th></tr> | |
1027 | <tr><th align="left">Quick Quiz:</th></tr> | |
1028 | <tr><td> | |
5413e24c PM |
1029 | Suppose that synchronize_rcu() did wait until <i>all</i> |
1030 | readers had completed instead of waiting only on | |
1031 | pre-existing readers. | |
1032 | For how long would the updater be able to rely on there | |
1033 | being no readers? | |
6146f8df PM |
1034 | </td></tr> |
1035 | <tr><th align="left">Answer:</th></tr> | |
1036 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
5413e24c | 1037 | For no time at all. |
6146f8df PM |
1038 | Even if <tt>synchronize_rcu()</tt> were to wait until |
1039 | all readers had completed, a new reader might start immediately after | |
1040 | <tt>synchronize_rcu()</tt> completed. | |
1041 | Therefore, the code following | |
5413e24c PM |
1042 | <tt>synchronize_rcu()</tt> can <i>never</i> rely on there being |
1043 | no readers. | |
6146f8df PM |
1044 | </font></td></tr> |
1045 | <tr><td> </td></tr> | |
1046 | </table> | |
649e4368 PM |
1047 | |
1048 | <h3><a name="Grace Periods Don't Partition Read-Side Critical Sections"> | |
1049 | Grace Periods Don't Partition Read-Side Critical Sections</a></h3> | |
1050 | ||
1051 | <p> | |
1052 | It is tempting to assume that if any part of one RCU read-side critical | |
1053 | section precedes a given grace period, and if any part of another RCU | |
1054 | read-side critical section follows that same grace period, then all of | |
1055 | the first RCU read-side critical section must precede all of the second. | |
1056 | However, this just isn't the case: A single grace period does not | |
1057 | partition the set of RCU read-side critical sections. | |
1058 | An example of this situation can be illustrated as follows, where | |
1059 | <tt>x</tt>, <tt>y</tt>, and <tt>z</tt> are initially all zero: | |
1060 | ||
1061 | <blockquote> | |
1062 | <pre> | |
1063 | 1 void thread0(void) | |
1064 | 2 { | |
1065 | 3 rcu_read_lock(); | |
1066 | 4 WRITE_ONCE(a, 1); | |
1067 | 5 WRITE_ONCE(b, 1); | |
1068 | 6 rcu_read_unlock(); | |
1069 | 7 } | |
1070 | 8 | |
1071 | 9 void thread1(void) | |
1072 | 10 { | |
1073 | 11 r1 = READ_ONCE(a); | |
1074 | 12 synchronize_rcu(); | |
1075 | 13 WRITE_ONCE(c, 1); | |
1076 | 14 } | |
1077 | 15 | |
1078 | 16 void thread2(void) | |
1079 | 17 { | |
1080 | 18 rcu_read_lock(); | |
1081 | 19 r2 = READ_ONCE(b); | |
1082 | 20 r3 = READ_ONCE(c); | |
1083 | 21 rcu_read_unlock(); | |
1084 | 22 } | |
1085 | </pre> | |
1086 | </blockquote> | |
1087 | ||
1088 | <p> | |
1089 | It turns out that the outcome: | |
1090 | ||
1091 | <blockquote> | |
1092 | <pre> | |
1093 | (r1 == 1 && r2 == 0 && r3 == 1) | |
1094 | </pre> | |
1095 | </blockquote> | |
1096 | ||
1097 | is entirely possible. | |
1098 | The following figure show how this can happen, with each circled | |
1099 | <tt>QS</tt> indicating the point at which RCU recorded a | |
1100 | <i>quiescent state</i> for each thread, that is, a state in which | |
1101 | RCU knows that the thread cannot be in the midst of an RCU read-side | |
1102 | critical section that started before the current grace period: | |
1103 | ||
1104 | <p><img src="GPpartitionReaders1.svg" alt="GPpartitionReaders1.svg" width="60%"></p> | |
1105 | ||
1106 | <p> | |
1107 | If it is necessary to partition RCU read-side critical sections in this | |
1108 | manner, it is necessary to use two grace periods, where the first | |
1109 | grace period is known to end before the second grace period starts: | |
1110 | ||
1111 | <blockquote> | |
1112 | <pre> | |
1113 | 1 void thread0(void) | |
1114 | 2 { | |
1115 | 3 rcu_read_lock(); | |
1116 | 4 WRITE_ONCE(a, 1); | |
1117 | 5 WRITE_ONCE(b, 1); | |
1118 | 6 rcu_read_unlock(); | |
1119 | 7 } | |
1120 | 8 | |
1121 | 9 void thread1(void) | |
1122 | 10 { | |
1123 | 11 r1 = READ_ONCE(a); | |
1124 | 12 synchronize_rcu(); | |
1125 | 13 WRITE_ONCE(c, 1); | |
1126 | 14 } | |
1127 | 15 | |
1128 | 16 void thread2(void) | |
1129 | 17 { | |
1130 | 18 r2 = READ_ONCE(c); | |
1131 | 19 synchronize_rcu(); | |
1132 | 20 WRITE_ONCE(d, 1); | |
1133 | 21 } | |
1134 | 22 | |
1135 | 23 void thread3(void) | |
1136 | 24 { | |
1137 | 25 rcu_read_lock(); | |
1138 | 26 r3 = READ_ONCE(b); | |
1139 | 27 r4 = READ_ONCE(d); | |
1140 | 28 rcu_read_unlock(); | |
1141 | 29 } | |
1142 | </pre> | |
1143 | </blockquote> | |
1144 | ||
1145 | <p> | |
1146 | Here, if <tt>(r1 == 1)</tt>, then | |
1147 | <tt>thread0()</tt>'s write to <tt>b</tt> must happen | |
1148 | before the end of <tt>thread1()</tt>'s grace period. | |
1149 | If in addition <tt>(r4 == 1)</tt>, then | |
1150 | <tt>thread3()</tt>'s read from <tt>b</tt> must happen | |
1151 | after the beginning of <tt>thread2()</tt>'s grace period. | |
1152 | If it is also the case that <tt>(r2 == 1)</tt>, then the | |
1153 | end of <tt>thread1()</tt>'s grace period must precede the | |
1154 | beginning of <tt>thread2()</tt>'s grace period. | |
1155 | This mean that the two RCU read-side critical sections cannot overlap, | |
1156 | guaranteeing that <tt>(r3 == 1)</tt>. | |
1157 | As a result, the outcome: | |
1158 | ||
1159 | <blockquote> | |
1160 | <pre> | |
1161 | (r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1) | |
1162 | </pre> | |
1163 | </blockquote> | |
1164 | ||
1165 | cannot happen. | |
1166 | ||
1167 | <p> | |
1168 | This non-requirement was also non-premeditated, but became apparent | |
1169 | when studying RCU's interaction with memory ordering. | |
1170 | ||
1171 | <h3><a name="Read-Side Critical Sections Don't Partition Grace Periods"> | |
1172 | Read-Side Critical Sections Don't Partition Grace Periods</a></h3> | |
1173 | ||
1174 | <p> | |
1175 | It is also tempting to assume that if an RCU read-side critical section | |
1176 | happens between a pair of grace periods, then those grace periods cannot | |
1177 | overlap. | |
1178 | However, this temptation leads nowhere good, as can be illustrated by | |
1179 | the following, with all variables initially zero: | |
1180 | ||
1181 | <blockquote> | |
1182 | <pre> | |
1183 | 1 void thread0(void) | |
1184 | 2 { | |
1185 | 3 rcu_read_lock(); | |
1186 | 4 WRITE_ONCE(a, 1); | |
1187 | 5 WRITE_ONCE(b, 1); | |
1188 | 6 rcu_read_unlock(); | |
1189 | 7 } | |
1190 | 8 | |
1191 | 9 void thread1(void) | |
1192 | 10 { | |
1193 | 11 r1 = READ_ONCE(a); | |
1194 | 12 synchronize_rcu(); | |
1195 | 13 WRITE_ONCE(c, 1); | |
1196 | 14 } | |
1197 | 15 | |
1198 | 16 void thread2(void) | |
1199 | 17 { | |
1200 | 18 rcu_read_lock(); | |
1201 | 19 WRITE_ONCE(d, 1); | |
1202 | 20 r2 = READ_ONCE(c); | |
1203 | 21 rcu_read_unlock(); | |
1204 | 22 } | |
1205 | 23 | |
1206 | 24 void thread3(void) | |
1207 | 25 { | |
1208 | 26 r3 = READ_ONCE(d); | |
1209 | 27 synchronize_rcu(); | |
1210 | 28 WRITE_ONCE(e, 1); | |
1211 | 29 } | |
1212 | 30 | |
1213 | 31 void thread4(void) | |
1214 | 32 { | |
1215 | 33 rcu_read_lock(); | |
1216 | 34 r4 = READ_ONCE(b); | |
1217 | 35 r5 = READ_ONCE(e); | |
1218 | 36 rcu_read_unlock(); | |
1219 | 37 } | |
1220 | </pre> | |
1221 | </blockquote> | |
1222 | ||
1223 | <p> | |
1224 | In this case, the outcome: | |
1225 | ||
1226 | <blockquote> | |
1227 | <pre> | |
1228 | (r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1) | |
1229 | </pre> | |
1230 | </blockquote> | |
1231 | ||
1232 | is entirely possible, as illustrated below: | |
1233 | ||
1234 | <p><img src="ReadersPartitionGP1.svg" alt="ReadersPartitionGP1.svg" width="100%"></p> | |
1235 | ||
1236 | <p> | |
1237 | Again, an RCU read-side critical section can overlap almost all of a | |
1238 | given grace period, just so long as it does not overlap the entire | |
1239 | grace period. | |
1240 | As a result, an RCU read-side critical section cannot partition a pair | |
1241 | of RCU grace periods. | |
1242 | ||
6146f8df PM |
1243 | <table> |
1244 | <tr><th> </th></tr> | |
1245 | <tr><th align="left">Quick Quiz:</th></tr> | |
1246 | <tr><td> | |
1247 | How long a sequence of grace periods, each separated by an RCU | |
1248 | read-side critical section, would be required to partition the RCU | |
1249 | read-side critical sections at the beginning and end of the chain? | |
1250 | </td></tr> | |
1251 | <tr><th align="left">Answer:</th></tr> | |
1252 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
1253 | In theory, an infinite number. | |
1254 | In practice, an unknown number that is sensitive to both implementation | |
1255 | details and timing considerations. | |
1256 | Therefore, even in practice, RCU users must abide by the | |
1257 | theoretical rather than the practical answer. | |
1258 | </font></td></tr> | |
1259 | <tr><td> </td></tr> | |
1260 | </table> | |
649e4368 PM |
1261 | |
1262 | <h3><a name="Disabling Preemption Does Not Block Grace Periods"> | |
1263 | Disabling Preemption Does Not Block Grace Periods</a></h3> | |
1264 | ||
1265 | <p> | |
1266 | There was a time when disabling preemption on any given CPU would block | |
1267 | subsequent grace periods. | |
1268 | However, this was an accident of implementation and is not a requirement. | |
1269 | And in the current Linux-kernel implementation, disabling preemption | |
1270 | on a given CPU in fact does not block grace periods, as Oleg Nesterov | |
1271 | <a href="https://lkml.kernel.org/g/20150614193825.GA19582@redhat.com">demonstrated</a>. | |
1272 | ||
1273 | <p> | |
1274 | If you need a preempt-disable region to block grace periods, you need to add | |
1275 | <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>, for example | |
1276 | as follows: | |
1277 | ||
1278 | <blockquote> | |
1279 | <pre> | |
1280 | 1 preempt_disable(); | |
1281 | 2 rcu_read_lock(); | |
1282 | 3 do_something(); | |
1283 | 4 rcu_read_unlock(); | |
1284 | 5 preempt_enable(); | |
1285 | 6 | |
1286 | 7 /* Spinlocks implicitly disable preemption. */ | |
1287 | 8 spin_lock(&mylock); | |
1288 | 9 rcu_read_lock(); | |
1289 | 10 do_something(); | |
1290 | 11 rcu_read_unlock(); | |
1291 | 12 spin_unlock(&mylock); | |
1292 | </pre> | |
1293 | </blockquote> | |
1294 | ||
1295 | <p> | |
1296 | In theory, you could enter the RCU read-side critical section first, | |
1297 | but it is more efficient to keep the entire RCU read-side critical | |
1298 | section contained in the preempt-disable region as shown above. | |
1299 | Of course, RCU read-side critical sections that extend outside of | |
1300 | preempt-disable regions will work correctly, but such critical sections | |
1301 | can be preempted, which forces <tt>rcu_read_unlock()</tt> to do | |
1302 | more work. | |
1303 | And no, this is <i>not</i> an invitation to enclose all of your RCU | |
1304 | read-side critical sections within preempt-disable regions, because | |
1305 | doing so would degrade real-time response. | |
1306 | ||
1307 | <p> | |
1308 | This non-requirement appeared with preemptible RCU. | |
649e4368 PM |
1309 | |
1310 | <h2><a name="Parallelism Facts of Life">Parallelism Facts of Life</a></h2> | |
1311 | ||
1312 | <p> | |
1313 | These parallelism facts of life are by no means specific to RCU, but | |
1314 | the RCU implementation must abide by them. | |
1315 | They therefore bear repeating: | |
1316 | ||
1317 | <ol> | |
1318 | <li> Any CPU or task may be delayed at any time, | |
1319 | and any attempts to avoid these delays by disabling | |
1320 | preemption, interrupts, or whatever are completely futile. | |
1321 | This is most obvious in preemptible user-level | |
1322 | environments and in virtualized environments (where | |
1323 | a given guest OS's VCPUs can be preempted at any time by | |
1324 | the underlying hypervisor), but can also happen in bare-metal | |
1325 | environments due to ECC errors, NMIs, and other hardware | |
1326 | events. | |
1327 | Although a delay of more than about 20 seconds can result | |
1328 | in splats, the RCU implementation is obligated to use | |
1329 | algorithms that can tolerate extremely long delays, but where | |
1330 | “extremely long” is not long enough to allow | |
1331 | wrap-around when incrementing a 64-bit counter. | |
1332 | <li> Both the compiler and the CPU can reorder memory accesses. | |
1333 | Where it matters, RCU must use compiler directives and | |
1334 | memory-barrier instructions to preserve ordering. | |
1335 | <li> Conflicting writes to memory locations in any given cache line | |
1336 | will result in expensive cache misses. | |
1337 | Greater numbers of concurrent writes and more-frequent | |
1338 | concurrent writes will result in more dramatic slowdowns. | |
1339 | RCU is therefore obligated to use algorithms that have | |
1340 | sufficient locality to avoid significant performance and | |
1341 | scalability problems. | |
1342 | <li> As a rough rule of thumb, only one CPU's worth of processing | |
1343 | may be carried out under the protection of any given exclusive | |
1344 | lock. | |
1345 | RCU must therefore use scalable locking designs. | |
1346 | <li> Counters are finite, especially on 32-bit systems. | |
1347 | RCU's use of counters must therefore tolerate counter wrap, | |
1348 | or be designed such that counter wrap would take way more | |
1349 | time than a single system is likely to run. | |
1350 | An uptime of ten years is quite possible, a runtime | |
1351 | of a century much less so. | |
1352 | As an example of the latter, RCU's dyntick-idle nesting counter | |
1353 | allows 54 bits for interrupt nesting level (this counter | |
1354 | is 64 bits even on a 32-bit system). | |
1355 | Overflowing this counter requires 2<sup>54</sup> | |
1356 | half-interrupts on a given CPU without that CPU ever going idle. | |
1357 | If a half-interrupt happened every microsecond, it would take | |
1358 | 570 years of runtime to overflow this counter, which is currently | |
1359 | believed to be an acceptably long time. | |
1360 | <li> Linux systems can have thousands of CPUs running a single | |
1361 | Linux kernel in a single shared-memory environment. | |
1362 | RCU must therefore pay close attention to high-end scalability. | |
1363 | </ol> | |
1364 | ||
1365 | <p> | |
1366 | This last parallelism fact of life means that RCU must pay special | |
1367 | attention to the preceding facts of life. | |
1368 | The idea that Linux might scale to systems with thousands of CPUs would | |
1369 | have been met with some skepticism in the 1990s, but these requirements | |
1370 | would have otherwise have been unsurprising, even in the early 1990s. | |
1371 | ||
1372 | <h2><a name="Quality-of-Implementation Requirements">Quality-of-Implementation Requirements</a></h2> | |
1373 | ||
1374 | <p> | |
1375 | These sections list quality-of-implementation requirements. | |
1376 | Although an RCU implementation that ignores these requirements could | |
1377 | still be used, it would likely be subject to limitations that would | |
1378 | make it inappropriate for industrial-strength production use. | |
1379 | Classes of quality-of-implementation requirements are as follows: | |
1380 | ||
1381 | <ol> | |
1382 | <li> <a href="#Specialization">Specialization</a> | |
1383 | <li> <a href="#Performance and Scalability">Performance and Scalability</a> | |
1384 | <li> <a href="#Composability">Composability</a> | |
1385 | <li> <a href="#Corner Cases">Corner Cases</a> | |
1386 | </ol> | |
1387 | ||
1388 | <p> | |
1389 | These classes is covered in the following sections. | |
1390 | ||
1391 | <h3><a name="Specialization">Specialization</a></h3> | |
1392 | ||
1393 | <p> | |
11a65df5 PM |
1394 | RCU is and always has been intended primarily for read-mostly situations, |
1395 | which means that RCU's read-side primitives are optimized, often at the | |
649e4368 | 1396 | expense of its update-side primitives. |
11a65df5 | 1397 | Experience thus far is captured by the following list of situations: |
649e4368 | 1398 | |
11a65df5 PM |
1399 | <ol> |
1400 | <li> Read-mostly data, where stale and inconsistent data is not | |
1401 | a problem: RCU works great! | |
1402 | <li> Read-mostly data, where data must be consistent: | |
1403 | RCU works well. | |
1404 | <li> Read-write data, where data must be consistent: | |
1405 | RCU <i>might</i> work OK. | |
1406 | Or not. | |
1407 | <li> Write-mostly data, where data must be consistent: | |
1408 | RCU is very unlikely to be the right tool for the job, | |
1409 | with the following exceptions, where RCU can provide: | |
1410 | <ol type=a> | |
1411 | <li> Existence guarantees for update-friendly mechanisms. | |
1412 | <li> Wait-free read-side primitives for real-time use. | |
1413 | </ol> | |
1414 | </ol> | |
649e4368 PM |
1415 | |
1416 | <p> | |
1417 | This focus on read-mostly situations means that RCU must interoperate | |
1418 | with other synchronization primitives. | |
1419 | For example, the <tt>add_gp()</tt> and <tt>remove_gp_synchronous()</tt> | |
1420 | examples discussed earlier use RCU to protect readers and locking to | |
1421 | coordinate updaters. | |
1422 | However, the need extends much farther, requiring that a variety of | |
1423 | synchronization primitives be legal within RCU read-side critical sections, | |
1424 | including spinlocks, sequence locks, atomic operations, reference | |
1425 | counters, and memory barriers. | |
1426 | ||
6146f8df PM |
1427 | <table> |
1428 | <tr><th> </th></tr> | |
1429 | <tr><th align="left">Quick Quiz:</th></tr> | |
1430 | <tr><td> | |
1431 | What about sleeping locks? | |
1432 | </td></tr> | |
1433 | <tr><th align="left">Answer:</th></tr> | |
1434 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
1435 | These are forbidden within Linux-kernel RCU read-side critical | |
1436 | sections because it is not legal to place a quiescent state | |
1437 | (in this case, voluntary context switch) within an RCU read-side | |
1438 | critical section. | |
1439 | However, sleeping locks may be used within userspace RCU read-side | |
1440 | critical sections, and also within Linux-kernel sleepable RCU | |
1441 | <a href="#Sleepable RCU"><font color="ffffff">(SRCU)</font></a> | |
1442 | read-side critical sections. | |
1443 | In addition, the -rt patchset turns spinlocks into a | |
1444 | sleeping locks so that the corresponding critical sections | |
1445 | can be preempted, which also means that these sleeplockified | |
1446 | spinlocks (but not other sleeping locks!) may be acquire within | |
1447 | -rt-Linux-kernel RCU read-side critical sections. | |
1448 | </font> | |
1449 | ||
1450 | <p><font color="ffffff"> | |
1451 | Note that it <i>is</i> legal for a normal RCU read-side | |
1452 | critical section to conditionally acquire a sleeping locks | |
1453 | (as in <tt>mutex_trylock()</tt>), but only as long as it does | |
1454 | not loop indefinitely attempting to conditionally acquire that | |
1455 | sleeping locks. | |
1456 | The key point is that things like <tt>mutex_trylock()</tt> | |
1457 | either return with the mutex held, or return an error indication if | |
1458 | the mutex was not immediately available. | |
1459 | Either way, <tt>mutex_trylock()</tt> returns immediately without | |
1460 | sleeping. | |
1461 | </font></td></tr> | |
1462 | <tr><td> </td></tr> | |
1463 | </table> | |
649e4368 PM |
1464 | |
1465 | <p> | |
1466 | It often comes as a surprise that many algorithms do not require a | |
1467 | consistent view of data, but many can function in that mode, | |
1468 | with network routing being the poster child. | |
1469 | Internet routing algorithms take significant time to propagate | |
1470 | updates, so that by the time an update arrives at a given system, | |
1471 | that system has been sending network traffic the wrong way for | |
1472 | a considerable length of time. | |
1473 | Having a few threads continue to send traffic the wrong way for a | |
1474 | few more milliseconds is clearly not a problem: In the worst case, | |
1475 | TCP retransmissions will eventually get the data where it needs to go. | |
1476 | In general, when tracking the state of the universe outside of the | |
1477 | computer, some level of inconsistency must be tolerated due to | |
1478 | speed-of-light delays if nothing else. | |
1479 | ||
1480 | <p> | |
1481 | Furthermore, uncertainty about external state is inherent in many cases. | |
526914a0 | 1482 | For example, a pair of veterinarians might use heartbeat to determine |
649e4368 PM |
1483 | whether or not a given cat was alive. |
1484 | But how long should they wait after the last heartbeat to decide that | |
1485 | the cat is in fact dead? | |
1486 | Waiting less than 400 milliseconds makes no sense because this would | |
1487 | mean that a relaxed cat would be considered to cycle between death | |
1488 | and life more than 100 times per minute. | |
1489 | Moreover, just as with human beings, a cat's heart might stop for | |
1490 | some period of time, so the exact wait period is a judgment call. | |
526914a0 | 1491 | One of our pair of veterinarians might wait 30 seconds before pronouncing |
649e4368 | 1492 | the cat dead, while the other might insist on waiting a full minute. |
526914a0 | 1493 | The two veterinarians would then disagree on the state of the cat during |
11a65df5 | 1494 | the final 30 seconds of the minute following the last heartbeat. |
649e4368 PM |
1495 | |
1496 | <p> | |
1497 | Interestingly enough, this same situation applies to hardware. | |
1498 | When push comes to shove, how do we tell whether or not some | |
1499 | external server has failed? | |
1500 | We send messages to it periodically, and declare it failed if we | |
1501 | don't receive a response within a given period of time. | |
1502 | Policy decisions can usually tolerate short | |
1503 | periods of inconsistency. | |
1504 | The policy was decided some time ago, and is only now being put into | |
1505 | effect, so a few milliseconds of delay is normally inconsequential. | |
1506 | ||
1507 | <p> | |
1508 | However, there are algorithms that absolutely must see consistent data. | |
1509 | For example, the translation between a user-level SystemV semaphore | |
1510 | ID to the corresponding in-kernel data structure is protected by RCU, | |
1511 | but it is absolutely forbidden to update a semaphore that has just been | |
1512 | removed. | |
1513 | In the Linux kernel, this need for consistency is accommodated by acquiring | |
1514 | spinlocks located in the in-kernel data structure from within | |
1515 | the RCU read-side critical section, and this is indicated by the | |
1516 | green box in the figure above. | |
1517 | Many other techniques may be used, and are in fact used within the | |
1518 | Linux kernel. | |
1519 | ||
1520 | <p> | |
1521 | In short, RCU is not required to maintain consistency, and other | |
1522 | mechanisms may be used in concert with RCU when consistency is required. | |
1523 | RCU's specialization allows it to do its job extremely well, and its | |
1524 | ability to interoperate with other synchronization mechanisms allows | |
1525 | the right mix of synchronization tools to be used for a given job. | |
1526 | ||
1527 | <h3><a name="Performance and Scalability">Performance and Scalability</a></h3> | |
1528 | ||
1529 | <p> | |
1530 | Energy efficiency is a critical component of performance today, | |
1531 | and Linux-kernel RCU implementations must therefore avoid unnecessarily | |
1532 | awakening idle CPUs. | |
1533 | I cannot claim that this requirement was premeditated. | |
1534 | In fact, I learned of it during a telephone conversation in which I | |
1535 | was given “frank and open” feedback on the importance | |
1536 | of energy efficiency in battery-powered systems and on specific | |
1537 | energy-efficiency shortcomings of the Linux-kernel RCU implementation. | |
1538 | In my experience, the battery-powered embedded community will consider | |
1539 | any unnecessary wakeups to be extremely unfriendly acts. | |
1540 | So much so that mere Linux-kernel-mailing-list posts are | |
1541 | insufficient to vent their ire. | |
1542 | ||
1543 | <p> | |
1544 | Memory consumption is not particularly important for in most | |
1545 | situations, and has become decreasingly | |
1546 | so as memory sizes have expanded and memory | |
1547 | costs have plummeted. | |
1548 | However, as I learned from Matt Mackall's | |
1549 | <a href="http://elinux.org/Linux_Tiny-FAQ">bloatwatch</a> | |
1550 | efforts, memory footprint is critically important on single-CPU systems with | |
1551 | non-preemptible (<tt>CONFIG_PREEMPT=n</tt>) kernels, and thus | |
1552 | <a href="https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com">tiny RCU</a> | |
1553 | was born. | |
1554 | Josh Triplett has since taken over the small-memory banner with his | |
1555 | <a href="https://tiny.wiki.kernel.org/">Linux kernel tinification</a> | |
1556 | project, which resulted in | |
1557 | <a href="#Sleepable RCU">SRCU</a> | |
1558 | becoming optional for those kernels not needing it. | |
1559 | ||
1560 | <p> | |
1561 | The remaining performance requirements are, for the most part, | |
1562 | unsurprising. | |
1563 | For example, in keeping with RCU's read-side specialization, | |
1564 | <tt>rcu_dereference()</tt> should have negligible overhead (for | |
1565 | example, suppression of a few minor compiler optimizations). | |
1566 | Similarly, in non-preemptible environments, <tt>rcu_read_lock()</tt> and | |
1567 | <tt>rcu_read_unlock()</tt> should have exactly zero overhead. | |
1568 | ||
1569 | <p> | |
1570 | In preemptible environments, in the case where the RCU read-side | |
1571 | critical section was not preempted (as will be the case for the | |
1572 | highest-priority real-time process), <tt>rcu_read_lock()</tt> and | |
1573 | <tt>rcu_read_unlock()</tt> should have minimal overhead. | |
1574 | In particular, they should not contain atomic read-modify-write | |
1575 | operations, memory-barrier instructions, preemption disabling, | |
1576 | interrupt disabling, or backwards branches. | |
1577 | However, in the case where the RCU read-side critical section was preempted, | |
1578 | <tt>rcu_read_unlock()</tt> may acquire spinlocks and disable interrupts. | |
1579 | This is why it is better to nest an RCU read-side critical section | |
1580 | within a preempt-disable region than vice versa, at least in cases | |
1581 | where that critical section is short enough to avoid unduly degrading | |
1582 | real-time latencies. | |
1583 | ||
1584 | <p> | |
1585 | The <tt>synchronize_rcu()</tt> grace-period-wait primitive is | |
1586 | optimized for throughput. | |
1587 | It may therefore incur several milliseconds of latency in addition to | |
1588 | the duration of the longest RCU read-side critical section. | |
1589 | On the other hand, multiple concurrent invocations of | |
1590 | <tt>synchronize_rcu()</tt> are required to use batching optimizations | |
1591 | so that they can be satisfied by a single underlying grace-period-wait | |
1592 | operation. | |
1593 | For example, in the Linux kernel, it is not unusual for a single | |
1594 | grace-period-wait operation to serve more than | |
1595 | <a href="https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response">1,000 separate invocations</a> | |
1596 | of <tt>synchronize_rcu()</tt>, thus amortizing the per-invocation | |
1597 | overhead down to nearly zero. | |
1598 | However, the grace-period optimization is also required to avoid | |
1599 | measurable degradation of real-time scheduling and interrupt latencies. | |
1600 | ||
1601 | <p> | |
1602 | In some cases, the multi-millisecond <tt>synchronize_rcu()</tt> | |
1603 | latencies are unacceptable. | |
1604 | In these cases, <tt>synchronize_rcu_expedited()</tt> may be used | |
1605 | instead, reducing the grace-period latency down to a few tens of | |
1606 | microseconds on small systems, at least in cases where the RCU read-side | |
1607 | critical sections are short. | |
1608 | There are currently no special latency requirements for | |
1609 | <tt>synchronize_rcu_expedited()</tt> on large systems, but, | |
1610 | consistent with the empirical nature of the RCU specification, | |
1611 | that is subject to change. | |
1612 | However, there most definitely are scalability requirements: | |
1613 | A storm of <tt>synchronize_rcu_expedited()</tt> invocations on 4096 | |
1614 | CPUs should at least make reasonable forward progress. | |
1615 | In return for its shorter latencies, <tt>synchronize_rcu_expedited()</tt> | |
1616 | is permitted to impose modest degradation of real-time latency | |
1617 | on non-idle online CPUs. | |
6771853b PM |
1618 | Here, “modest” means roughly the same latency |
1619 | degradation as a scheduling-clock interrupt. | |
649e4368 PM |
1620 | |
1621 | <p> | |
1622 | There are a number of situations where even | |
1623 | <tt>synchronize_rcu_expedited()</tt>'s reduced grace-period | |
1624 | latency is unacceptable. | |
1625 | In these situations, the asynchronous <tt>call_rcu()</tt> can be | |
1626 | used in place of <tt>synchronize_rcu()</tt> as follows: | |
1627 | ||
1628 | <blockquote> | |
1629 | <pre> | |
1630 | 1 struct foo { | |
1631 | 2 int a; | |
1632 | 3 int b; | |
1633 | 4 struct rcu_head rh; | |
1634 | 5 }; | |
1635 | 6 | |
1636 | 7 static void remove_gp_cb(struct rcu_head *rhp) | |
1637 | 8 { | |
1638 | 9 struct foo *p = container_of(rhp, struct foo, rh); | |
1639 | 10 | |
1640 | 11 kfree(p); | |
1641 | 12 } | |
1642 | 13 | |
1643 | 14 bool remove_gp_asynchronous(void) | |
1644 | 15 { | |
1645 | 16 struct foo *p; | |
1646 | 17 | |
1647 | 18 spin_lock(&gp_lock); | |
1648 | 19 p = rcu_dereference(gp); | |
1649 | 20 if (!p) { | |
1650 | 21 spin_unlock(&gp_lock); | |
1651 | 22 return false; | |
1652 | 23 } | |
1653 | 24 rcu_assign_pointer(gp, NULL); | |
1654 | 25 call_rcu(&p->rh, remove_gp_cb); | |
1655 | 26 spin_unlock(&gp_lock); | |
1656 | 27 return true; | |
1657 | 28 } | |
1658 | </pre> | |
1659 | </blockquote> | |
1660 | ||
1661 | <p> | |
1662 | A definition of <tt>struct foo</tt> is finally needed, and appears | |
1663 | on lines 1-5. | |
1664 | The function <tt>remove_gp_cb()</tt> is passed to <tt>call_rcu()</tt> | |
1665 | on line 25, and will be invoked after the end of a subsequent | |
1666 | grace period. | |
1667 | This gets the same effect as <tt>remove_gp_synchronous()</tt>, | |
1668 | but without forcing the updater to wait for a grace period to elapse. | |
1669 | The <tt>call_rcu()</tt> function may be used in a number of | |
1670 | situations where neither <tt>synchronize_rcu()</tt> nor | |
1671 | <tt>synchronize_rcu_expedited()</tt> would be legal, | |
1672 | including within preempt-disable code, <tt>local_bh_disable()</tt> code, | |
1673 | interrupt-disable code, and interrupt handlers. | |
514f1eb5 | 1674 | However, even <tt>call_rcu()</tt> is illegal within NMI handlers |
0c7d10e4 | 1675 | and from idle and offline CPUs. |
649e4368 PM |
1676 | The callback function (<tt>remove_gp_cb()</tt> in this case) will be |
1677 | executed within softirq (software interrupt) environment within the | |
1678 | Linux kernel, | |
1679 | either within a real softirq handler or under the protection | |
1680 | of <tt>local_bh_disable()</tt>. | |
1681 | In both the Linux kernel and in userspace, it is bad practice to | |
1682 | write an RCU callback function that takes too long. | |
1683 | Long-running operations should be relegated to separate threads or | |
1684 | (in the Linux kernel) workqueues. | |
1685 | ||
6146f8df PM |
1686 | <table> |
1687 | <tr><th> </th></tr> | |
1688 | <tr><th align="left">Quick Quiz:</th></tr> | |
1689 | <tr><td> | |
1690 | Why does line 19 use <tt>rcu_access_pointer()</tt>? | |
1691 | After all, <tt>call_rcu()</tt> on line 25 stores into the | |
1692 | structure, which would interact badly with concurrent insertions. | |
1693 | Doesn't this mean that <tt>rcu_dereference()</tt> is required? | |
1694 | </td></tr> | |
1695 | <tr><th align="left">Answer:</th></tr> | |
1696 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
1697 | Presumably the <tt>->gp_lock</tt> acquired on line 18 excludes | |
1698 | any changes, including any insertions that <tt>rcu_dereference()</tt> | |
1699 | would protect against. | |
1700 | Therefore, any insertions will be delayed until after | |
1701 | <tt>->gp_lock</tt> | |
1702 | is released on line 25, which in turn means that | |
1703 | <tt>rcu_access_pointer()</tt> suffices. | |
1704 | </font></td></tr> | |
1705 | <tr><td> </td></tr> | |
1706 | </table> | |
649e4368 PM |
1707 | |
1708 | <p> | |
1709 | However, all that <tt>remove_gp_cb()</tt> is doing is | |
1710 | invoking <tt>kfree()</tt> on the data element. | |
1711 | This is a common idiom, and is supported by <tt>kfree_rcu()</tt>, | |
1712 | which allows “fire and forget” operation as shown below: | |
1713 | ||
1714 | <blockquote> | |
1715 | <pre> | |
1716 | 1 struct foo { | |
1717 | 2 int a; | |
1718 | 3 int b; | |
1719 | 4 struct rcu_head rh; | |
1720 | 5 }; | |
1721 | 6 | |
1722 | 7 bool remove_gp_faf(void) | |
1723 | 8 { | |
1724 | 9 struct foo *p; | |
1725 | 10 | |
1726 | 11 spin_lock(&gp_lock); | |
1727 | 12 p = rcu_dereference(gp); | |
1728 | 13 if (!p) { | |
1729 | 14 spin_unlock(&gp_lock); | |
1730 | 15 return false; | |
1731 | 16 } | |
1732 | 17 rcu_assign_pointer(gp, NULL); | |
1733 | 18 kfree_rcu(p, rh); | |
1734 | 19 spin_unlock(&gp_lock); | |
1735 | 20 return true; | |
1736 | 21 } | |
1737 | </pre> | |
1738 | </blockquote> | |
1739 | ||
1740 | <p> | |
1741 | Note that <tt>remove_gp_faf()</tt> simply invokes | |
1742 | <tt>kfree_rcu()</tt> and proceeds, without any need to pay any | |
1743 | further attention to the subsequent grace period and <tt>kfree()</tt>. | |
1744 | It is permissible to invoke <tt>kfree_rcu()</tt> from the same | |
1745 | environments as for <tt>call_rcu()</tt>. | |
1746 | Interestingly enough, DYNIX/ptx had the equivalents of | |
1747 | <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>, but not | |
1748 | <tt>synchronize_rcu()</tt>. | |
1749 | This was due to the fact that RCU was not heavily used within DYNIX/ptx, | |
1750 | so the very few places that needed something like | |
1751 | <tt>synchronize_rcu()</tt> simply open-coded it. | |
1752 | ||
6146f8df PM |
1753 | <table> |
1754 | <tr><th> </th></tr> | |
1755 | <tr><th align="left">Quick Quiz:</th></tr> | |
1756 | <tr><td> | |
1757 | Earlier it was claimed that <tt>call_rcu()</tt> and | |
1758 | <tt>kfree_rcu()</tt> allowed updaters to avoid being blocked | |
1759 | by readers. | |
1760 | But how can that be correct, given that the invocation of the callback | |
1761 | and the freeing of the memory (respectively) must still wait for | |
1762 | a grace period to elapse? | |
1763 | </td></tr> | |
1764 | <tr><th align="left">Answer:</th></tr> | |
1765 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
1766 | We could define things this way, but keep in mind that this sort of | |
1767 | definition would say that updates in garbage-collected languages | |
1768 | cannot complete until the next time the garbage collector runs, | |
1769 | which does not seem at all reasonable. | |
1770 | The key point is that in most cases, an updater using either | |
1771 | <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> can proceed to the | |
1772 | next update as soon as it has invoked <tt>call_rcu()</tt> or | |
1773 | <tt>kfree_rcu()</tt>, without having to wait for a subsequent | |
1774 | grace period. | |
1775 | </font></td></tr> | |
1776 | <tr><td> </td></tr> | |
1777 | </table> | |
649e4368 PM |
1778 | |
1779 | <p> | |
1780 | But what if the updater must wait for the completion of code to be | |
1781 | executed after the end of the grace period, but has other tasks | |
1782 | that can be carried out in the meantime? | |
1783 | The polling-style <tt>get_state_synchronize_rcu()</tt> and | |
1784 | <tt>cond_synchronize_rcu()</tt> functions may be used for this | |
1785 | purpose, as shown below: | |
1786 | ||
1787 | <blockquote> | |
1788 | <pre> | |
1789 | 1 bool remove_gp_poll(void) | |
1790 | 2 { | |
1791 | 3 struct foo *p; | |
1792 | 4 unsigned long s; | |
1793 | 5 | |
1794 | 6 spin_lock(&gp_lock); | |
1795 | 7 p = rcu_access_pointer(gp); | |
1796 | 8 if (!p) { | |
1797 | 9 spin_unlock(&gp_lock); | |
1798 | 10 return false; | |
1799 | 11 } | |
1800 | 12 rcu_assign_pointer(gp, NULL); | |
1801 | 13 spin_unlock(&gp_lock); | |
1802 | 14 s = get_state_synchronize_rcu(); | |
1803 | 15 do_something_while_waiting(); | |
1804 | 16 cond_synchronize_rcu(s); | |
1805 | 17 kfree(p); | |
1806 | 18 return true; | |
1807 | 19 } | |
1808 | </pre> | |
1809 | </blockquote> | |
1810 | ||
1811 | <p> | |
1812 | On line 14, <tt>get_state_synchronize_rcu()</tt> obtains a | |
1813 | “cookie” from RCU, | |
1814 | then line 15 carries out other tasks, | |
1815 | and finally, line 16 returns immediately if a grace period has | |
1816 | elapsed in the meantime, but otherwise waits as required. | |
1817 | The need for <tt>get_state_synchronize_rcu</tt> and | |
1818 | <tt>cond_synchronize_rcu()</tt> has appeared quite recently, | |
1819 | so it is too early to tell whether they will stand the test of time. | |
1820 | ||
1821 | <p> | |
1822 | RCU thus provides a range of tools to allow updaters to strike the | |
1823 | required tradeoff between latency, flexibility and CPU overhead. | |
1824 | ||
1825 | <h3><a name="Composability">Composability</a></h3> | |
1826 | ||
1827 | <p> | |
1828 | Composability has received much attention in recent years, perhaps in part | |
1829 | due to the collision of multicore hardware with object-oriented techniques | |
1830 | designed in single-threaded environments for single-threaded use. | |
1831 | And in theory, RCU read-side critical sections may be composed, and in | |
1832 | fact may be nested arbitrarily deeply. | |
1833 | In practice, as with all real-world implementations of composable | |
1834 | constructs, there are limitations. | |
1835 | ||
1836 | <p> | |
1837 | Implementations of RCU for which <tt>rcu_read_lock()</tt> | |
1838 | and <tt>rcu_read_unlock()</tt> generate no code, such as | |
1839 | Linux-kernel RCU when <tt>CONFIG_PREEMPT=n</tt>, can be | |
1840 | nested arbitrarily deeply. | |
1841 | After all, there is no overhead. | |
1842 | Except that if all these instances of <tt>rcu_read_lock()</tt> | |
1843 | and <tt>rcu_read_unlock()</tt> are visible to the compiler, | |
1844 | compilation will eventually fail due to exhausting memory, | |
1845 | mass storage, or user patience, whichever comes first. | |
1846 | If the nesting is not visible to the compiler, as is the case with | |
1847 | mutually recursive functions each in its own translation unit, | |
1848 | stack overflow will result. | |
c75e9caa PM |
1849 | If the nesting takes the form of loops, perhaps in the guise of tail |
1850 | recursion, either the control variable | |
649e4368 PM |
1851 | will overflow or (in the Linux kernel) you will get an RCU CPU stall warning. |
1852 | Nevertheless, this class of RCU implementations is one | |
1853 | of the most composable constructs in existence. | |
1854 | ||
1855 | <p> | |
1856 | RCU implementations that explicitly track nesting depth | |
1857 | are limited by the nesting-depth counter. | |
1858 | For example, the Linux kernel's preemptible RCU limits nesting to | |
1859 | <tt>INT_MAX</tt>. | |
1860 | This should suffice for almost all practical purposes. | |
1861 | That said, a consecutive pair of RCU read-side critical sections | |
1862 | between which there is an operation that waits for a grace period | |
1863 | cannot be enclosed in another RCU read-side critical section. | |
1864 | This is because it is not legal to wait for a grace period within | |
1865 | an RCU read-side critical section: To do so would result either | |
1866 | in deadlock or | |
1867 | in RCU implicitly splitting the enclosing RCU read-side critical | |
1868 | section, neither of which is conducive to a long-lived and prosperous | |
1869 | kernel. | |
1870 | ||
0825458b PM |
1871 | <p> |
1872 | It is worth noting that RCU is not alone in limiting composability. | |
1873 | For example, many transactional-memory implementations prohibit | |
1874 | composing a pair of transactions separated by an irrevocable | |
1875 | operation (for example, a network receive operation). | |
1876 | For another example, lock-based critical sections can be composed | |
1877 | surprisingly freely, but only if deadlock is avoided. | |
1878 | ||
649e4368 PM |
1879 | <p> |
1880 | In short, although RCU read-side critical sections are highly composable, | |
1881 | care is required in some situations, just as is the case for any other | |
1882 | composable synchronization mechanism. | |
1883 | ||
1884 | <h3><a name="Corner Cases">Corner Cases</a></h3> | |
1885 | ||
1886 | <p> | |
1887 | A given RCU workload might have an endless and intense stream of | |
1888 | RCU read-side critical sections, perhaps even so intense that there | |
1889 | was never a point in time during which there was not at least one | |
1890 | RCU read-side critical section in flight. | |
1891 | RCU cannot allow this situation to block grace periods: As long as | |
1892 | all the RCU read-side critical sections are finite, grace periods | |
1893 | must also be finite. | |
1894 | ||
1895 | <p> | |
1896 | That said, preemptible RCU implementations could potentially result | |
1897 | in RCU read-side critical sections being preempted for long durations, | |
1898 | which has the effect of creating a long-duration RCU read-side | |
1899 | critical section. | |
1900 | This situation can arise only in heavily loaded systems, but systems using | |
1901 | real-time priorities are of course more vulnerable. | |
1902 | Therefore, RCU priority boosting is provided to help deal with this | |
1903 | case. | |
1904 | That said, the exact requirements on RCU priority boosting will likely | |
1905 | evolve as more experience accumulates. | |
1906 | ||
1907 | <p> | |
1908 | Other workloads might have very high update rates. | |
1909 | Although one can argue that such workloads should instead use | |
1910 | something other than RCU, the fact remains that RCU must | |
1911 | handle such workloads gracefully. | |
1912 | This requirement is another factor driving batching of grace periods, | |
1913 | but it is also the driving force behind the checks for large numbers | |
1914 | of queued RCU callbacks in the <tt>call_rcu()</tt> code path. | |
1915 | Finally, high update rates should not delay RCU read-side critical | |
6771853b | 1916 | sections, although some small read-side delays can occur when using |
649e4368 | 1917 | <tt>synchronize_rcu_expedited()</tt>, courtesy of this function's use |
6771853b | 1918 | of <tt>smp_call_function_single()</tt>. |
649e4368 PM |
1919 | |
1920 | <p> | |
1921 | Although all three of these corner cases were understood in the early | |
1922 | 1990s, a simple user-level test consisting of <tt>close(open(path))</tt> | |
1923 | in a tight loop | |
1924 | in the early 2000s suddenly provided a much deeper appreciation of the | |
1925 | high-update-rate corner case. | |
1926 | This test also motivated addition of some RCU code to react to high update | |
1927 | rates, for example, if a given CPU finds itself with more than 10,000 | |
1928 | RCU callbacks queued, it will cause RCU to take evasive action by | |
1929 | more aggressively starting grace periods and more aggressively forcing | |
1930 | completion of grace-period processing. | |
1931 | This evasive action causes the grace period to complete more quickly, | |
1932 | but at the cost of restricting RCU's batching optimizations, thus | |
1933 | increasing the CPU overhead incurred by that grace period. | |
1934 | ||
1935 | <h2><a name="Software-Engineering Requirements"> | |
1936 | Software-Engineering Requirements</a></h2> | |
1937 | ||
1938 | <p> | |
1939 | Between Murphy's Law and “To err is human”, it is necessary to | |
1940 | guard against mishaps and misuse: | |
1941 | ||
1942 | <ol> | |
1943 | <li> It is all too easy to forget to use <tt>rcu_read_lock()</tt> | |
1944 | everywhere that it is needed, so kernels built with | |
526914a0 | 1945 | <tt>CONFIG_PROVE_RCU=y</tt> will splat if |
649e4368 PM |
1946 | <tt>rcu_dereference()</tt> is used outside of an |
1947 | RCU read-side critical section. | |
1948 | Update-side code can use <tt>rcu_dereference_protected()</tt>, | |
1949 | which takes a | |
1950 | <a href="https://lwn.net/Articles/371986/">lockdep expression</a> | |
1951 | to indicate what is providing the protection. | |
1952 | If the indicated protection is not provided, a lockdep splat | |
1953 | is emitted. | |
1954 | ||
1955 | <p> | |
1956 | Code shared between readers and updaters can use | |
1957 | <tt>rcu_dereference_check()</tt>, which also takes a | |
1958 | lockdep expression, and emits a lockdep splat if neither | |
1959 | <tt>rcu_read_lock()</tt> nor the indicated protection | |
1960 | is in place. | |
1961 | In addition, <tt>rcu_dereference_raw()</tt> is used in those | |
1962 | (hopefully rare) cases where the required protection cannot | |
1963 | be easily described. | |
1964 | Finally, <tt>rcu_read_lock_held()</tt> is provided to | |
1965 | allow a function to verify that it has been invoked within | |
1966 | an RCU read-side critical section. | |
1967 | I was made aware of this set of requirements shortly after Thomas | |
1968 | Gleixner audited a number of RCU uses. | |
1969 | <li> A given function might wish to check for RCU-related preconditions | |
1970 | upon entry, before using any other RCU API. | |
1971 | The <tt>rcu_lockdep_assert()</tt> does this job, | |
1972 | asserting the expression in kernels having lockdep enabled | |
1973 | and doing nothing otherwise. | |
1974 | <li> It is also easy to forget to use <tt>rcu_assign_pointer()</tt> | |
1975 | and <tt>rcu_dereference()</tt>, perhaps (incorrectly) | |
1976 | substituting a simple assignment. | |
1977 | To catch this sort of error, a given RCU-protected pointer may be | |
41a2901e PM |
1978 | tagged with <tt>__rcu</tt>, after which sparse |
1979 | will complain about simple-assignment accesses to that pointer. | |
649e4368 PM |
1980 | Arnd Bergmann made me aware of this requirement, and also |
1981 | supplied the needed | |
1982 | <a href="https://lwn.net/Articles/376011/">patch series</a>. | |
1983 | <li> Kernels built with <tt>CONFIG_DEBUG_OBJECTS_RCU_HEAD=y</tt> | |
1984 | will splat if a data element is passed to <tt>call_rcu()</tt> | |
1985 | twice in a row, without a grace period in between. | |
1986 | (This error is similar to a double free.) | |
1987 | The corresponding <tt>rcu_head</tt> structures that are | |
1988 | dynamically allocated are automatically tracked, but | |
1989 | <tt>rcu_head</tt> structures allocated on the stack | |
1990 | must be initialized with <tt>init_rcu_head_on_stack()</tt> | |
1991 | and cleaned up with <tt>destroy_rcu_head_on_stack()</tt>. | |
1992 | Similarly, statically allocated non-stack <tt>rcu_head</tt> | |
1993 | structures must be initialized with <tt>init_rcu_head()</tt> | |
1994 | and cleaned up with <tt>destroy_rcu_head()</tt>. | |
1995 | Mathieu Desnoyers made me aware of this requirement, and also | |
1996 | supplied the needed | |
1997 | <a href="https://lkml.kernel.org/g/20100319013024.GA28456@Krystal">patch</a>. | |
1998 | <li> An infinite loop in an RCU read-side critical section will | |
01d3ad38 PM |
1999 | eventually trigger an RCU CPU stall warning splat, with |
2000 | the duration of “eventually” being controlled by the | |
2001 | <tt>RCU_CPU_STALL_TIMEOUT</tt> <tt>Kconfig</tt> option, or, | |
2002 | alternatively, by the | |
2003 | <tt>rcupdate.rcu_cpu_stall_timeout</tt> boot/sysfs | |
2004 | parameter. | |
649e4368 PM |
2005 | However, RCU is not obligated to produce this splat |
2006 | unless there is a grace period waiting on that particular | |
2007 | RCU read-side critical section. | |
01d3ad38 PM |
2008 | <p> |
2009 | Some extreme workloads might intentionally delay | |
2010 | RCU grace periods, and systems running those workloads can | |
2011 | be booted with <tt>rcupdate.rcu_cpu_stall_suppress</tt> | |
2012 | to suppress the splats. | |
2013 | This kernel parameter may also be set via <tt>sysfs</tt>. | |
2014 | Furthermore, RCU CPU stall warnings are counter-productive | |
2015 | during sysrq dumps and during panics. | |
2016 | RCU therefore supplies the <tt>rcu_sysrq_start()</tt> and | |
2017 | <tt>rcu_sysrq_end()</tt> API members to be called before | |
2018 | and after long sysrq dumps. | |
2019 | RCU also supplies the <tt>rcu_panic()</tt> notifier that is | |
2020 | automatically invoked at the beginning of a panic to suppress | |
2021 | further RCU CPU stall warnings. | |
2022 | ||
2023 | <p> | |
649e4368 PM |
2024 | This requirement made itself known in the early 1990s, pretty |
2025 | much the first time that it was necessary to debug a CPU stall. | |
01d3ad38 PM |
2026 | That said, the initial implementation in DYNIX/ptx was quite |
2027 | generic in comparison with that of Linux. | |
649e4368 PM |
2028 | <li> Although it would be very good to detect pointers leaking out |
2029 | of RCU read-side critical sections, there is currently no | |
2030 | good way of doing this. | |
2031 | One complication is the need to distinguish between pointers | |
2032 | leaking and pointers that have been handed off from RCU to | |
2033 | some other synchronization mechanism, for example, reference | |
2034 | counting. | |
2035 | <li> In kernels built with <tt>CONFIG_RCU_TRACE=y</tt>, RCU-related | |
ae91aa0a | 2036 | information is provided via event tracing. |
649e4368 PM |
2037 | <li> Open-coded use of <tt>rcu_assign_pointer()</tt> and |
2038 | <tt>rcu_dereference()</tt> to create typical linked | |
2039 | data structures can be surprisingly error-prone. | |
2040 | Therefore, RCU-protected | |
2041 | <a href="https://lwn.net/Articles/609973/#RCU List APIs">linked lists</a> | |
2042 | and, more recently, RCU-protected | |
2043 | <a href="https://lwn.net/Articles/612100/">hash tables</a> | |
2044 | are available. | |
2045 | Many other special-purpose RCU-protected data structures are | |
2046 | available in the Linux kernel and the userspace RCU library. | |
2047 | <li> Some linked structures are created at compile time, but still | |
2048 | require <tt>__rcu</tt> checking. | |
2049 | The <tt>RCU_POINTER_INITIALIZER()</tt> macro serves this | |
2050 | purpose. | |
2051 | <li> It is not necessary to use <tt>rcu_assign_pointer()</tt> | |
2052 | when creating linked structures that are to be published via | |
2053 | a single external pointer. | |
2054 | The <tt>RCU_INIT_POINTER()</tt> macro is provided for | |
2055 | this task and also for assigning <tt>NULL</tt> pointers | |
2056 | at runtime. | |
2057 | </ol> | |
2058 | ||
2059 | <p> | |
2060 | This not a hard-and-fast list: RCU's diagnostic capabilities will | |
2061 | continue to be guided by the number and type of usage bugs found | |
2062 | in real-world RCU usage. | |
2063 | ||
2064 | <h2><a name="Linux Kernel Complications">Linux Kernel Complications</a></h2> | |
2065 | ||
2066 | <p> | |
2067 | The Linux kernel provides an interesting environment for all kinds of | |
2068 | software, including RCU. | |
2069 | Some of the relevant points of interest are as follows: | |
2070 | ||
2071 | <ol> | |
2072 | <li> <a href="#Configuration">Configuration</a>. | |
2073 | <li> <a href="#Firmware Interface">Firmware Interface</a>. | |
2074 | <li> <a href="#Early Boot">Early Boot</a>. | |
2075 | <li> <a href="#Interrupts and NMIs"> | |
2076 | Interrupts and non-maskable interrupts (NMIs)</a>. | |
2077 | <li> <a href="#Loadable Modules">Loadable Modules</a>. | |
2078 | <li> <a href="#Hotplug CPU">Hotplug CPU</a>. | |
2079 | <li> <a href="#Scheduler and RCU">Scheduler and RCU</a>. | |
2080 | <li> <a href="#Tracing and RCU">Tracing and RCU</a>. | |
2081 | <li> <a href="#Energy Efficiency">Energy Efficiency</a>. | |
850bf6d5 PM |
2082 | <li> <a href="#Scheduling-Clock Interrupts and RCU"> |
2083 | Scheduling-Clock Interrupts and RCU</a>. | |
701e8031 | 2084 | <li> <a href="#Memory Efficiency">Memory Efficiency</a>. |
649e4368 PM |
2085 | <li> <a href="#Performance, Scalability, Response Time, and Reliability"> |
2086 | Performance, Scalability, Response Time, and Reliability</a>. | |
2087 | </ol> | |
2088 | ||
2089 | <p> | |
2090 | This list is probably incomplete, but it does give a feel for the | |
2091 | most notable Linux-kernel complications. | |
2092 | Each of the following sections covers one of the above topics. | |
2093 | ||
2094 | <h3><a name="Configuration">Configuration</a></h3> | |
2095 | ||
2096 | <p> | |
2097 | RCU's goal is automatic configuration, so that almost nobody | |
2098 | needs to worry about RCU's <tt>Kconfig</tt> options. | |
2099 | And for almost all users, RCU does in fact work well | |
2100 | “out of the box.” | |
2101 | ||
2102 | <p> | |
2103 | However, there are specialized use cases that are handled by | |
2104 | kernel boot parameters and <tt>Kconfig</tt> options. | |
2105 | Unfortunately, the <tt>Kconfig</tt> system will explicitly ask users | |
2106 | about new <tt>Kconfig</tt> options, which requires almost all of them | |
2107 | be hidden behind a <tt>CONFIG_RCU_EXPERT</tt> <tt>Kconfig</tt> option. | |
2108 | ||
2109 | <p> | |
2110 | This all should be quite obvious, but the fact remains that | |
2111 | Linus Torvalds recently had to | |
2112 | <a href="https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com">remind</a> | |
2113 | me of this requirement. | |
2114 | ||
2115 | <h3><a name="Firmware Interface">Firmware Interface</a></h3> | |
2116 | ||
2117 | <p> | |
2118 | In many cases, kernel obtains information about the system from the | |
2119 | firmware, and sometimes things are lost in translation. | |
2120 | Or the translation is accurate, but the original message is bogus. | |
2121 | ||
2122 | <p> | |
2123 | For example, some systems' firmware overreports the number of CPUs, | |
2124 | sometimes by a large factor. | |
2125 | If RCU naively believed the firmware, as it used to do, | |
2126 | it would create too many per-CPU kthreads. | |
2127 | Although the resulting system will still run correctly, the extra | |
2128 | kthreads needlessly consume memory and can cause confusion | |
2129 | when they show up in <tt>ps</tt> listings. | |
2130 | ||
2131 | <p> | |
2132 | RCU must therefore wait for a given CPU to actually come online before | |
2133 | it can allow itself to believe that the CPU actually exists. | |
2134 | The resulting “ghost CPUs” (which are never going to | |
2135 | come online) cause a number of | |
2136 | <a href="https://paulmck.livejournal.com/37494.html">interesting complications</a>. | |
2137 | ||
2138 | <h3><a name="Early Boot">Early Boot</a></h3> | |
2139 | ||
2140 | <p> | |
2141 | The Linux kernel's boot sequence is an interesting process, | |
2142 | and RCU is used early, even before <tt>rcu_init()</tt> | |
2143 | is invoked. | |
2144 | In fact, a number of RCU's primitives can be used as soon as the | |
2145 | initial task's <tt>task_struct</tt> is available and the | |
2146 | boot CPU's per-CPU variables are set up. | |
2147 | The read-side primitives (<tt>rcu_read_lock()</tt>, | |
2148 | <tt>rcu_read_unlock()</tt>, <tt>rcu_dereference()</tt>, | |
2149 | and <tt>rcu_access_pointer()</tt>) will operate normally very early on, | |
2150 | as will <tt>rcu_assign_pointer()</tt>. | |
2151 | ||
2152 | <p> | |
2153 | Although <tt>call_rcu()</tt> may be invoked at any | |
2154 | time during boot, callbacks are not guaranteed to be invoked until after | |
f1387d77 PM |
2155 | all of RCU's kthreads have been spawned, which occurs at |
2156 | <tt>early_initcall()</tt> time. | |
649e4368 PM |
2157 | This delay in callback invocation is due to the fact that RCU does not |
2158 | invoke callbacks until it is fully initialized, and this full initialization | |
2159 | cannot occur until after the scheduler has initialized itself to the | |
2160 | point where RCU can spawn and run its kthreads. | |
2161 | In theory, it would be possible to invoke callbacks earlier, | |
2162 | however, this is not a panacea because there would be severe restrictions | |
2163 | on what operations those callbacks could invoke. | |
2164 | ||
2165 | <p> | |
77095901 | 2166 | Perhaps surprisingly, <tt>synchronize_rcu()</tt> and |
f1387d77 | 2167 | <tt>synchronize_rcu_expedited()</tt>, |
77095901 | 2168 | will operate normally |
649e4368 PM |
2169 | during very early boot, the reason being that there is only one CPU |
2170 | and preemption is disabled. | |
2171 | This means that the call <tt>synchronize_rcu()</tt> (or friends) | |
2172 | itself is a quiescent | |
2173 | state and thus a grace period, so the early-boot implementation can | |
2174 | be a no-op. | |
2175 | ||
2176 | <p> | |
f1387d77 PM |
2177 | However, once the scheduler has spawned its first kthread, this early |
2178 | boot trick fails for <tt>synchronize_rcu()</tt> (as well as for | |
2179 | <tt>synchronize_rcu_expedited()</tt>) in <tt>CONFIG_PREEMPT=y</tt> | |
2180 | kernels. | |
2181 | The reason is that an RCU read-side critical section might be preempted, | |
2182 | which means that a subsequent <tt>synchronize_rcu()</tt> really does have | |
2183 | to wait for something, as opposed to simply returning immediately. | |
2184 | Unfortunately, <tt>synchronize_rcu()</tt> can't do this until all of | |
2185 | its kthreads are spawned, which doesn't happen until some time during | |
2186 | <tt>early_initcalls()</tt> time. | |
2187 | But this is no excuse: RCU is nevertheless required to correctly handle | |
6771853b | 2188 | synchronous grace periods during this time period. |
f1387d77 PM |
2189 | Once all of its kthreads are up and running, RCU starts running |
2190 | normally. | |
649e4368 | 2191 | |
6146f8df PM |
2192 | <table> |
2193 | <tr><th> </th></tr> | |
2194 | <tr><th align="left">Quick Quiz:</th></tr> | |
2195 | <tr><td> | |
f1387d77 PM |
2196 | How can RCU possibly handle grace periods before all of its |
2197 | kthreads have been spawned??? | |
6146f8df PM |
2198 | </td></tr> |
2199 | <tr><th align="left">Answer:</th></tr> | |
2200 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
f1387d77 | 2201 | Very carefully! |
6771853b | 2202 | </font> |
f1387d77 | 2203 | |
6771853b PM |
2204 | <p><font color="ffffff"> |
2205 | During the “dead zone” between the time that the | |
f1387d77 PM |
2206 | scheduler spawns the first task and the time that all of RCU's |
2207 | kthreads have been spawned, all synchronous grace periods are | |
2208 | handled by the expedited grace-period mechanism. | |
2209 | At runtime, this expedited mechanism relies on workqueues, but | |
2210 | during the dead zone the requesting task itself drives the | |
2211 | desired expedited grace period. | |
2212 | Because dead-zone execution takes place within task context, | |
2213 | everything works. | |
2214 | Once the dead zone ends, expedited grace periods go back to | |
2215 | using workqueues, as is required to avoid problems that would | |
2216 | otherwise occur when a user task received a POSIX signal while | |
2217 | driving an expedited grace period. | |
6771853b | 2218 | </font> |
f1387d77 | 2219 | |
6771853b PM |
2220 | <p><font color="ffffff"> |
2221 | And yes, this does mean that it is unhelpful to send POSIX | |
f1387d77 PM |
2222 | signals to random tasks between the time that the scheduler |
2223 | spawns its first kthread and the time that RCU's kthreads | |
2224 | have all been spawned. | |
2225 | If there ever turns out to be a good reason for sending POSIX | |
2226 | signals during that time, appropriate adjustments will be made. | |
2227 | (If it turns out that POSIX signals are sent during this time for | |
2228 | no good reason, other adjustments will be made, appropriate | |
2229 | or otherwise.) | |
6146f8df PM |
2230 | </font></td></tr> |
2231 | <tr><td> </td></tr> | |
2232 | </table> | |
649e4368 PM |
2233 | |
2234 | <p> | |
2235 | I learned of these boot-time requirements as a result of a series of | |
2236 | system hangs. | |
2237 | ||
2238 | <h3><a name="Interrupts and NMIs">Interrupts and NMIs</a></h3> | |
2239 | ||
2240 | <p> | |
2241 | The Linux kernel has interrupts, and RCU read-side critical sections are | |
2242 | legal within interrupt handlers and within interrupt-disabled regions | |
2243 | of code, as are invocations of <tt>call_rcu()</tt>. | |
2244 | ||
2245 | <p> | |
2246 | Some Linux-kernel architectures can enter an interrupt handler from | |
2247 | non-idle process context, and then just never leave it, instead stealthily | |
2248 | transitioning back to process context. | |
2249 | This trick is sometimes used to invoke system calls from inside the kernel. | |
2250 | These “half-interrupts” mean that RCU has to be very careful | |
2251 | about how it counts interrupt nesting levels. | |
2252 | I learned of this requirement the hard way during a rewrite | |
2253 | of RCU's dyntick-idle code. | |
2254 | ||
2255 | <p> | |
2256 | The Linux kernel has non-maskable interrupts (NMIs), and | |
2257 | RCU read-side critical sections are legal within NMI handlers. | |
2258 | Thankfully, RCU update-side primitives, including | |
2259 | <tt>call_rcu()</tt>, are prohibited within NMI handlers. | |
2260 | ||
2261 | <p> | |
2262 | The name notwithstanding, some Linux-kernel architectures | |
2263 | can have nested NMIs, which RCU must handle correctly. | |
2264 | Andy Lutomirski | |
a5a28895 | 2265 | <a href="https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com">surprised me</a> |
649e4368 PM |
2266 | with this requirement; |
2267 | he also kindly surprised me with | |
a5a28895 | 2268 | <a href="https://lkml.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com">an algorithm</a> |
649e4368 PM |
2269 | that meets this requirement. |
2270 | ||
e77cb325 PM |
2271 | <p> |
2272 | Furthermore, NMI handlers can be interrupted by what appear to RCU | |
2273 | to be normal interrupts. | |
2274 | One way that this can happen is for code that directly invokes | |
2275 | <tt>rcu_irq_enter()</tt> and </tt>rcu_irq_exit()</tt> to be called | |
2276 | from an NMI handler. | |
2277 | This astonishing fact of life prompted the current code structure, | |
2278 | which has <tt>rcu_irq_enter()</tt> invoking <tt>rcu_nmi_enter()</tt> | |
2279 | and <tt>rcu_irq_exit()</tt> invoking <tt>rcu_nmi_exit()</tt>. | |
2280 | And yes, I also learned of this requirement the hard way. | |
2281 | ||
649e4368 PM |
2282 | <h3><a name="Loadable Modules">Loadable Modules</a></h3> |
2283 | ||
2284 | <p> | |
2285 | The Linux kernel has loadable modules, and these modules can | |
2286 | also be unloaded. | |
2287 | After a given module has been unloaded, any attempt to call | |
2288 | one of its functions results in a segmentation fault. | |
2289 | The module-unload functions must therefore cancel any | |
2290 | delayed calls to loadable-module functions, for example, | |
2291 | any outstanding <tt>mod_timer()</tt> must be dealt with | |
2292 | via <tt>del_timer_sync()</tt> or similar. | |
2293 | ||
2294 | <p> | |
2295 | Unfortunately, there is no way to cancel an RCU callback; | |
2296 | once you invoke <tt>call_rcu()</tt>, the callback function is | |
2297 | going to eventually be invoked, unless the system goes down first. | |
2298 | Because it is normally considered socially irresponsible to crash the system | |
2299 | in response to a module unload request, we need some other way | |
2300 | to deal with in-flight RCU callbacks. | |
2301 | ||
2302 | <p> | |
2303 | RCU therefore provides | |
2304 | <tt><a href="https://lwn.net/Articles/217484/">rcu_barrier()</a></tt>, | |
2305 | which waits until all in-flight RCU callbacks have been invoked. | |
2306 | If a module uses <tt>call_rcu()</tt>, its exit function should therefore | |
2307 | prevent any future invocation of <tt>call_rcu()</tt>, then invoke | |
2308 | <tt>rcu_barrier()</tt>. | |
2309 | In theory, the underlying module-unload code could invoke | |
2310 | <tt>rcu_barrier()</tt> unconditionally, but in practice this would | |
2311 | incur unacceptable latencies. | |
2312 | ||
2313 | <p> | |
2314 | Nikita Danilov noted this requirement for an analogous filesystem-unmount | |
2315 | situation, and Dipankar Sarma incorporated <tt>rcu_barrier()</tt> into RCU. | |
2316 | The need for <tt>rcu_barrier()</tt> for module unloading became | |
2317 | apparent later. | |
2318 | ||
6771853b PM |
2319 | <p> |
2320 | <b>Important note</b>: The <tt>rcu_barrier()</tt> function is not, | |
2321 | repeat, <i>not</i>, obligated to wait for a grace period. | |
2322 | It is instead only required to wait for RCU callbacks that have | |
2323 | already been posted. | |
2324 | Therefore, if there are no RCU callbacks posted anywhere in the system, | |
2325 | <tt>rcu_barrier()</tt> is within its rights to return immediately. | |
2326 | Even if there are callbacks posted, <tt>rcu_barrier()</tt> does not | |
2327 | necessarily need to wait for a grace period. | |
2328 | ||
2329 | <table> | |
2330 | <tr><th> </th></tr> | |
2331 | <tr><th align="left">Quick Quiz:</th></tr> | |
2332 | <tr><td> | |
2333 | Wait a minute! | |
2334 | Each RCU callbacks must wait for a grace period to complete, | |
2335 | and <tt>rcu_barrier()</tt> must wait for each pre-existing | |
2336 | callback to be invoked. | |
2337 | Doesn't <tt>rcu_barrier()</tt> therefore need to wait for | |
2338 | a full grace period if there is even one callback posted anywhere | |
2339 | in the system? | |
2340 | </td></tr> | |
2341 | <tr><th align="left">Answer:</th></tr> | |
2342 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
2343 | Absolutely not!!! | |
2344 | </font> | |
2345 | ||
2346 | <p><font color="ffffff"> | |
2347 | Yes, each RCU callbacks must wait for a grace period to complete, | |
2348 | but it might well be partly (or even completely) finished waiting | |
2349 | by the time <tt>rcu_barrier()</tt> is invoked. | |
2350 | In that case, <tt>rcu_barrier()</tt> need only wait for the | |
2351 | remaining portion of the grace period to elapse. | |
2352 | So even if there are quite a few callbacks posted, | |
2353 | <tt>rcu_barrier()</tt> might well return quite quickly. | |
2354 | </font> | |
2355 | ||
2356 | <p><font color="ffffff"> | |
2357 | So if you need to wait for a grace period as well as for all | |
2358 | pre-existing callbacks, you will need to invoke both | |
2359 | <tt>synchronize_rcu()</tt> and <tt>rcu_barrier()</tt>. | |
2360 | If latency is a concern, you can always use workqueues | |
2361 | to invoke them concurrently. | |
2362 | </font></td></tr> | |
2363 | <tr><td> </td></tr> | |
2364 | </table> | |
2365 | ||
649e4368 PM |
2366 | <h3><a name="Hotplug CPU">Hotplug CPU</a></h3> |
2367 | ||
2368 | <p> | |
2369 | The Linux kernel supports CPU hotplug, which means that CPUs | |
2370 | can come and go. | |
6771853b PM |
2371 | It is of course illegal to use any RCU API member from an offline CPU, |
2372 | with the exception of <a href="#Sleepable RCU">SRCU</a> read-side | |
2373 | critical sections. | |
649e4368 PM |
2374 | This requirement was present from day one in DYNIX/ptx, but |
2375 | on the other hand, the Linux kernel's CPU-hotplug implementation | |
2376 | is “interesting.” | |
2377 | ||
2378 | <p> | |
2379 | The Linux-kernel CPU-hotplug implementation has notifiers that | |
2380 | are used to allow the various kernel subsystems (including RCU) | |
2381 | to respond appropriately to a given CPU-hotplug operation. | |
2382 | Most RCU operations may be invoked from CPU-hotplug notifiers, | |
6771853b PM |
2383 | including even synchronous grace-period operations such as |
2384 | <tt>synchronize_rcu()</tt> and <tt>synchronize_rcu_expedited()</tt>. | |
649e4368 PM |
2385 | |
2386 | <p> | |
6771853b | 2387 | However, all-callback-wait operations such as |
649e4368 PM |
2388 | <tt>rcu_barrier()</tt> are also not supported, due to the |
2389 | fact that there are phases of CPU-hotplug operations where | |
2390 | the outgoing CPU's callbacks will not be invoked until after | |
2391 | the CPU-hotplug operation ends, which could also result in deadlock. | |
6771853b PM |
2392 | Furthermore, <tt>rcu_barrier()</tt> blocks CPU-hotplug operations |
2393 | during its execution, which results in another type of deadlock | |
2394 | when invoked from a CPU-hotplug notifier. | |
649e4368 PM |
2395 | |
2396 | <h3><a name="Scheduler and RCU">Scheduler and RCU</a></h3> | |
2397 | ||
2398 | <p> | |
2399 | RCU depends on the scheduler, and the scheduler uses RCU to | |
2400 | protect some of its data structures. | |
2401 | This means the scheduler is forbidden from acquiring | |
2402 | the runqueue locks and the priority-inheritance locks | |
a4b57562 PM |
2403 | in the middle of an outermost RCU read-side critical section unless either |
2404 | (1) it releases them before exiting that same | |
2405 | RCU read-side critical section, or | |
c64c4b0f | 2406 | (2) interrupts are disabled across |
a4b57562 PM |
2407 | that entire RCU read-side critical section. |
2408 | This same prohibition also applies (recursively!) to any lock that is acquired | |
649e4368 | 2409 | while holding any lock to which this prohibition applies. |
a4b57562 PM |
2410 | Adhering to this rule prevents preemptible RCU from invoking |
2411 | <tt>rcu_read_unlock_special()</tt> while either runqueue or | |
2412 | priority-inheritance locks are held, thus avoiding deadlock. | |
649e4368 | 2413 | |
c64c4b0f PM |
2414 | <p> |
2415 | Prior to v4.4, it was only necessary to disable preemption across | |
2416 | RCU read-side critical sections that acquired scheduler locks. | |
2417 | In v4.4, expedited grace periods started using IPIs, and these | |
2418 | IPIs could force a <tt>rcu_read_unlock()</tt> to take the slowpath. | |
2419 | Therefore, this expedited-grace-period change required disabling of | |
2420 | interrupts, not just preemption. | |
2421 | ||
649e4368 PM |
2422 | <p> |
2423 | For RCU's part, the preemptible-RCU <tt>rcu_read_unlock()</tt> | |
2424 | implementation must be written carefully to avoid similar deadlocks. | |
2425 | In particular, <tt>rcu_read_unlock()</tt> must tolerate an | |
2426 | interrupt where the interrupt handler invokes both | |
2427 | <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>. | |
2428 | This possibility requires <tt>rcu_read_unlock()</tt> to use | |
2429 | negative nesting levels to avoid destructive recursion via | |
2430 | interrupt handler's use of RCU. | |
2431 | ||
2432 | <p> | |
2433 | This pair of mutual scheduler-RCU requirements came as a | |
2434 | <a href="https://lwn.net/Articles/453002/">complete surprise</a>. | |
2435 | ||
2436 | <p> | |
2437 | As noted above, RCU makes use of kthreads, and it is necessary to | |
2438 | avoid excessive CPU-time accumulation by these kthreads. | |
2439 | This requirement was no surprise, but RCU's violation of it | |
2440 | when running context-switch-heavy workloads when built with | |
2441 | <tt>CONFIG_NO_HZ_FULL=y</tt> | |
2442 | <a href="http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf">did come as a surprise [PDF]</a>. | |
2443 | RCU has made good progress towards meeting this requirement, even | |
2444 | for context-switch-have <tt>CONFIG_NO_HZ_FULL=y</tt> workloads, | |
2445 | but there is room for further improvement. | |
2446 | ||
2447 | <h3><a name="Tracing and RCU">Tracing and RCU</a></h3> | |
2448 | ||
2449 | <p> | |
2450 | It is possible to use tracing on RCU code, but tracing itself | |
2451 | uses RCU. | |
2452 | For this reason, <tt>rcu_dereference_raw_notrace()</tt> | |
2453 | is provided for use by tracing, which avoids the destructive | |
2454 | recursion that could otherwise ensue. | |
2455 | This API is also used by virtualization in some architectures, | |
2456 | where RCU readers execute in environments in which tracing | |
2457 | cannot be used. | |
2458 | The tracing folks both located the requirement and provided the | |
2459 | needed fix, so this surprise requirement was relatively painless. | |
2460 | ||
2461 | <h3><a name="Energy Efficiency">Energy Efficiency</a></h3> | |
2462 | ||
2463 | <p> | |
2464 | Interrupting idle CPUs is considered socially unacceptable, | |
2465 | especially by people with battery-powered embedded systems. | |
2466 | RCU therefore conserves energy by detecting which CPUs are | |
2467 | idle, including tracking CPUs that have been interrupted from idle. | |
2468 | This is a large part of the energy-efficiency requirement, | |
2469 | so I learned of this via an irate phone call. | |
2470 | ||
2471 | <p> | |
2472 | Because RCU avoids interrupting idle CPUs, it is illegal to | |
2473 | execute an RCU read-side critical section on an idle CPU. | |
2474 | (Kernels built with <tt>CONFIG_PROVE_RCU=y</tt> will splat | |
2475 | if you try it.) | |
2476 | The <tt>RCU_NONIDLE()</tt> macro and <tt>_rcuidle</tt> | |
2477 | event tracing is provided to work around this restriction. | |
2478 | In addition, <tt>rcu_is_watching()</tt> may be used to | |
2479 | test whether or not it is currently legal to run RCU read-side | |
2480 | critical sections on this CPU. | |
2481 | I learned of the need for diagnostics on the one hand | |
2482 | and <tt>RCU_NONIDLE()</tt> on the other while inspecting | |
2483 | idle-loop code. | |
2484 | Steven Rostedt supplied <tt>_rcuidle</tt> event tracing, | |
2485 | which is used quite heavily in the idle loop. | |
c79dac75 PM |
2486 | However, there are some restrictions on the code placed within |
2487 | <tt>RCU_NONIDLE()</tt>: | |
2488 | ||
2489 | <ol> | |
2490 | <li> Blocking is prohibited. | |
2491 | In practice, this is not a serious restriction given that idle | |
2492 | tasks are prohibited from blocking to begin with. | |
526914a0 | 2493 | <li> Although nesting <tt>RCU_NONIDLE()</tt> is permitted, they cannot |
c79dac75 PM |
2494 | nest indefinitely deeply. |
2495 | However, given that they can be nested on the order of a million | |
2496 | deep, even on 32-bit systems, this should not be a serious | |
2497 | restriction. | |
2498 | This nesting limit would probably be reached long after the | |
2499 | compiler OOMed or the stack overflowed. | |
2500 | <li> Any code path that enters <tt>RCU_NONIDLE()</tt> must sequence | |
2501 | out of that same <tt>RCU_NONIDLE()</tt>. | |
2502 | For example, the following is grossly illegal: | |
2503 | ||
2504 | <blockquote> | |
2505 | <pre> | |
2506 | 1 RCU_NONIDLE({ | |
2507 | 2 do_something(); | |
2508 | 3 goto bad_idea; /* BUG!!! */ | |
2509 | 4 do_something_else();}); | |
2510 | 5 bad_idea: | |
2511 | </pre> | |
2512 | </blockquote> | |
2513 | ||
2514 | <p> | |
2515 | It is just as illegal to transfer control into the middle of | |
2516 | <tt>RCU_NONIDLE()</tt>'s argument. | |
2517 | Yes, in theory, you could transfer in as long as you also | |
2518 | transferred out, but in practice you could also expect to get sharply | |
2519 | worded review comments. | |
2520 | </ol> | |
649e4368 PM |
2521 | |
2522 | <p> | |
2523 | It is similarly socially unacceptable to interrupt an | |
2524 | <tt>nohz_full</tt> CPU running in userspace. | |
2525 | RCU must therefore track <tt>nohz_full</tt> userspace | |
2526 | execution. | |
fe5ac724 | 2527 | RCU must therefore be able to sample state at two points in |
649e4368 PM |
2528 | time, and be able to determine whether or not some other CPU spent |
2529 | any time idle and/or executing in userspace. | |
2530 | ||
2531 | <p> | |
2532 | These energy-efficiency requirements have proven quite difficult to | |
2533 | understand and to meet, for example, there have been more than five | |
2534 | clean-sheet rewrites of RCU's energy-efficiency code, the last of | |
2535 | which was finally able to demonstrate | |
2536 | <a href="http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf">real energy savings running on real hardware [PDF]</a>. | |
2537 | As noted earlier, | |
2538 | I learned of many of these requirements via angry phone calls: | |
2539 | Flaming me on the Linux-kernel mailing list was apparently not | |
2540 | sufficient to fully vent their ire at RCU's energy-efficiency bugs! | |
2541 | ||
850bf6d5 PM |
2542 | <h3><a name="Scheduling-Clock Interrupts and RCU"> |
2543 | Scheduling-Clock Interrupts and RCU</a></h3> | |
2544 | ||
2545 | <p> | |
2546 | The kernel transitions between in-kernel non-idle execution, userspace | |
2547 | execution, and the idle loop. | |
2548 | Depending on kernel configuration, RCU handles these states differently: | |
2549 | ||
2550 | <table border=3> | |
2551 | <tr><th><tt>HZ</tt> Kconfig</th> | |
2552 | <th>In-Kernel</th> | |
2553 | <th>Usermode</th> | |
2554 | <th>Idle</th></tr> | |
2555 | <tr><th align="left"><tt>HZ_PERIODIC</tt></th> | |
2556 | <td>Can rely on scheduling-clock interrupt.</td> | |
2557 | <td>Can rely on scheduling-clock interrupt and its | |
2558 | detection of interrupt from usermode.</td> | |
2559 | <td>Can rely on RCU's dyntick-idle detection.</td></tr> | |
2560 | <tr><th align="left"><tt>NO_HZ_IDLE</tt></th> | |
2561 | <td>Can rely on scheduling-clock interrupt.</td> | |
2562 | <td>Can rely on scheduling-clock interrupt and its | |
2563 | detection of interrupt from usermode.</td> | |
2564 | <td>Can rely on RCU's dyntick-idle detection.</td></tr> | |
2565 | <tr><th align="left"><tt>NO_HZ_FULL</tt></th> | |
2566 | <td>Can only sometimes rely on scheduling-clock interrupt. | |
2567 | In other cases, it is necessary to bound kernel execution | |
2568 | times and/or use IPIs.</td> | |
2569 | <td>Can rely on RCU's dyntick-idle detection.</td> | |
2570 | <td>Can rely on RCU's dyntick-idle detection.</td></tr> | |
2571 | </table> | |
2572 | ||
2573 | <table> | |
2574 | <tr><th> </th></tr> | |
2575 | <tr><th align="left">Quick Quiz:</th></tr> | |
2576 | <tr><td> | |
2577 | Why can't <tt>NO_HZ_FULL</tt> in-kernel execution rely on the | |
2578 | scheduling-clock interrupt, just like <tt>HZ_PERIODIC</tt> | |
2579 | and <tt>NO_HZ_IDLE</tt> do? | |
2580 | </td></tr> | |
2581 | <tr><th align="left">Answer:</th></tr> | |
2582 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
2583 | Because, as a performance optimization, <tt>NO_HZ_FULL</tt> | |
2584 | does not necessarily re-enable the scheduling-clock interrupt | |
2585 | on entry to each and every system call. | |
2586 | </font></td></tr> | |
2587 | <tr><td> </td></tr> | |
2588 | </table> | |
2589 | ||
2590 | <p> | |
2591 | However, RCU must be reliably informed as to whether any given | |
2592 | CPU is currently in the idle loop, and, for <tt>NO_HZ_FULL</tt>, | |
2593 | also whether that CPU is executing in usermode, as discussed | |
2594 | <a href="#Energy Efficiency">earlier</a>. | |
2595 | It also requires that the scheduling-clock interrupt be enabled when | |
2596 | RCU needs it to be: | |
2597 | ||
2598 | <ol> | |
2599 | <li> If a CPU is either idle or executing in usermode, and RCU believes | |
2600 | it is non-idle, the scheduling-clock tick had better be running. | |
2601 | Otherwise, you will get RCU CPU stall warnings. Or at best, | |
2602 | very long (11-second) grace periods, with a pointless IPI waking | |
2603 | the CPU from time to time. | |
2604 | <li> If a CPU is in a portion of the kernel that executes RCU read-side | |
2605 | critical sections, and RCU believes this CPU to be idle, you will get | |
2606 | random memory corruption. <b>DON'T DO THIS!!!</b> | |
2607 | ||
2608 | <br>This is one reason to test with lockdep, which will complain | |
2609 | about this sort of thing. | |
2610 | <li> If a CPU is in a portion of the kernel that is absolutely | |
2611 | positively no-joking guaranteed to never execute any RCU read-side | |
2612 | critical sections, and RCU believes this CPU to to be idle, | |
2613 | no problem. This sort of thing is used by some architectures | |
2614 | for light-weight exception handlers, which can then avoid the | |
2615 | overhead of <tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt> | |
2616 | at exception entry and exit, respectively. | |
2617 | Some go further and avoid the entireties of <tt>irq_enter()</tt> | |
2618 | and <tt>irq_exit()</tt>. | |
2619 | ||
2620 | <br>Just make very sure you are running some of your tests with | |
2621 | <tt>CONFIG_PROVE_RCU=y</tt>, just in case one of your code paths | |
2622 | was in fact joking about not doing RCU read-side critical sections. | |
2623 | <li> If a CPU is executing in the kernel with the scheduling-clock | |
2624 | interrupt disabled and RCU believes this CPU to be non-idle, | |
2625 | and if the CPU goes idle (from an RCU perspective) every few | |
2626 | jiffies, no problem. It is usually OK for there to be the | |
2627 | occasional gap between idle periods of up to a second or so. | |
2628 | ||
2629 | <br>If the gap grows too long, you get RCU CPU stall warnings. | |
2630 | <li> If a CPU is either idle or executing in usermode, and RCU believes | |
2631 | it to be idle, of course no problem. | |
2632 | <li> If a CPU is executing in the kernel, the kernel code | |
2633 | path is passing through quiescent states at a reasonable | |
2634 | frequency (preferably about once per few jiffies, but the | |
2635 | occasional excursion to a second or so is usually OK) and the | |
2636 | scheduling-clock interrupt is enabled, of course no problem. | |
2637 | ||
2638 | <br>If the gap between a successive pair of quiescent states grows | |
2639 | too long, you get RCU CPU stall warnings. | |
2640 | </ol> | |
2641 | ||
2642 | <table> | |
2643 | <tr><th> </th></tr> | |
2644 | <tr><th align="left">Quick Quiz:</th></tr> | |
2645 | <tr><td> | |
2646 | But what if my driver has a hardware interrupt handler | |
2647 | that can run for many seconds? | |
2648 | I cannot invoke <tt>schedule()</tt> from an hardware | |
2649 | interrupt handler, after all! | |
2650 | </td></tr> | |
2651 | <tr><th align="left">Answer:</th></tr> | |
2652 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
2653 | One approach is to do <tt>rcu_irq_exit();rcu_irq_enter();</tt> | |
2654 | every so often. | |
2655 | But given that long-running interrupt handlers can cause | |
2656 | other problems, not least for response time, shouldn't you | |
2657 | work to keep your interrupt handler's runtime within reasonable | |
2658 | bounds? | |
2659 | </font></td></tr> | |
2660 | <tr><td> </td></tr> | |
2661 | </table> | |
2662 | ||
2663 | <p> | |
2664 | But as long as RCU is properly informed of kernel state transitions between | |
2665 | in-kernel execution, usermode execution, and idle, and as long as the | |
2666 | scheduling-clock interrupt is enabled when RCU needs it to be, you | |
2667 | can rest assured that the bugs you encounter will be in some other | |
2668 | part of RCU or some other part of the kernel! | |
2669 | ||
701e8031 PM |
2670 | <h3><a name="Memory Efficiency">Memory Efficiency</a></h3> |
2671 | ||
2672 | <p> | |
2673 | Although small-memory non-realtime systems can simply use Tiny RCU, | |
2674 | code size is only one aspect of memory efficiency. | |
2675 | Another aspect is the size of the <tt>rcu_head</tt> structure | |
2676 | used by <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>. | |
2677 | Although this structure contains nothing more than a pair of pointers, | |
2678 | it does appear in many RCU-protected data structures, including | |
2679 | some that are size critical. | |
2680 | The <tt>page</tt> structure is a case in point, as evidenced by | |
2681 | the many occurrences of the <tt>union</tt> keyword within that structure. | |
2682 | ||
2683 | <p> | |
2684 | This need for memory efficiency is one reason that RCU uses hand-crafted | |
2685 | singly linked lists to track the <tt>rcu_head</tt> structures that | |
2686 | are waiting for a grace period to elapse. | |
2687 | It is also the reason why <tt>rcu_head</tt> structures do not contain | |
2688 | debug information, such as fields tracking the file and line of the | |
2689 | <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> that posted them. | |
2690 | Although this information might appear in debug-only kernel builds at some | |
2691 | point, in the meantime, the <tt>->func</tt> field will often provide | |
2692 | the needed debug information. | |
2693 | ||
2694 | <p> | |
2695 | However, in some cases, the need for memory efficiency leads to even | |
2696 | more extreme measures. | |
2697 | Returning to the <tt>page</tt> structure, the <tt>rcu_head</tt> field | |
2698 | shares storage with a great many other structures that are used at | |
2699 | various points in the corresponding page's lifetime. | |
2700 | In order to correctly resolve certain | |
2701 | <a href="https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com">race conditions</a>, | |
2702 | the Linux kernel's memory-management subsystem needs a particular bit | |
2703 | to remain zero during all phases of grace-period processing, | |
2704 | and that bit happens to map to the bottom bit of the | |
2705 | <tt>rcu_head</tt> structure's <tt>->next</tt> field. | |
2706 | RCU makes this guarantee as long as <tt>call_rcu()</tt> | |
2707 | is used to post the callback, as opposed to <tt>kfree_rcu()</tt> | |
2708 | or some future “lazy” | |
2709 | variant of <tt>call_rcu()</tt> that might one day be created for | |
2710 | energy-efficiency purposes. | |
2711 | ||
ed2bec07 PM |
2712 | <p> |
2713 | That said, there are limits. | |
2714 | RCU requires that the <tt>rcu_head</tt> structure be aligned to a | |
2715 | two-byte boundary, and passing a misaligned <tt>rcu_head</tt> | |
2716 | structure to one of the <tt>call_rcu()</tt> family of functions | |
2717 | will result in a splat. | |
2718 | It is therefore necessary to exercise caution when packing | |
2719 | structures containing fields of type <tt>rcu_head</tt>. | |
2720 | Why not a four-byte or even eight-byte alignment requirement? | |
2721 | Because the m68k architecture provides only two-byte alignment, | |
2722 | and thus acts as alignment's least common denominator. | |
2723 | ||
2724 | <p> | |
2725 | The reason for reserving the bottom bit of pointers to | |
2726 | <tt>rcu_head</tt> structures is to leave the door open to | |
2727 | “lazy” callbacks whose invocations can safely be deferred. | |
2728 | Deferring invocation could potentially have energy-efficiency | |
2729 | benefits, but only if the rate of non-lazy callbacks decreases | |
2730 | significantly for some important workload. | |
2731 | In the meantime, reserving the bottom bit keeps this option open | |
2732 | in case it one day becomes useful. | |
2733 | ||
649e4368 PM |
2734 | <h3><a name="Performance, Scalability, Response Time, and Reliability"> |
2735 | Performance, Scalability, Response Time, and Reliability</a></h3> | |
2736 | ||
2737 | <p> | |
2738 | Expanding on the | |
2739 | <a href="#Performance and Scalability">earlier discussion</a>, | |
2740 | RCU is used heavily by hot code paths in performance-critical | |
2741 | portions of the Linux kernel's networking, security, virtualization, | |
2742 | and scheduling code paths. | |
2743 | RCU must therefore use efficient implementations, especially in its | |
2744 | read-side primitives. | |
2745 | To that end, it would be good if preemptible RCU's implementation | |
2746 | of <tt>rcu_read_lock()</tt> could be inlined, however, doing | |
2747 | this requires resolving <tt>#include</tt> issues with the | |
2748 | <tt>task_struct</tt> structure. | |
2749 | ||
2750 | <p> | |
2751 | The Linux kernel supports hardware configurations with up to | |
2752 | 4096 CPUs, which means that RCU must be extremely scalable. | |
2753 | Algorithms that involve frequent acquisitions of global locks or | |
2754 | frequent atomic operations on global variables simply cannot be | |
2755 | tolerated within the RCU implementation. | |
2756 | RCU therefore makes heavy use of a combining tree based on the | |
2757 | <tt>rcu_node</tt> structure. | |
2758 | RCU is required to tolerate all CPUs continuously invoking any | |
2759 | combination of RCU's runtime primitives with minimal per-operation | |
2760 | overhead. | |
2761 | In fact, in many cases, increasing load must <i>decrease</i> the | |
2762 | per-operation overhead, witness the batching optimizations for | |
2763 | <tt>synchronize_rcu()</tt>, <tt>call_rcu()</tt>, | |
2764 | <tt>synchronize_rcu_expedited()</tt>, and <tt>rcu_barrier()</tt>. | |
2765 | As a general rule, RCU must cheerfully accept whatever the | |
2766 | rest of the Linux kernel decides to throw at it. | |
2767 | ||
2768 | <p> | |
2769 | The Linux kernel is used for real-time workloads, especially | |
2770 | in conjunction with the | |
2771 | <a href="https://rt.wiki.kernel.org/index.php/Main_Page">-rt patchset</a>. | |
2772 | The real-time-latency response requirements are such that the | |
2773 | traditional approach of disabling preemption across RCU | |
2774 | read-side critical sections is inappropriate. | |
2775 | Kernels built with <tt>CONFIG_PREEMPT=y</tt> therefore | |
2776 | use an RCU implementation that allows RCU read-side critical | |
2777 | sections to be preempted. | |
2778 | This requirement made its presence known after users made it | |
2779 | clear that an earlier | |
2780 | <a href="https://lwn.net/Articles/107930/">real-time patch</a> | |
2781 | did not meet their needs, in conjunction with some | |
2782 | <a href="https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com">RCU issues</a> | |
2783 | encountered by a very early version of the -rt patchset. | |
2784 | ||
2785 | <p> | |
2786 | In addition, RCU must make do with a sub-100-microsecond real-time latency | |
2787 | budget. | |
2788 | In fact, on smaller systems with the -rt patchset, the Linux kernel | |
2789 | provides sub-20-microsecond real-time latencies for the whole kernel, | |
2790 | including RCU. | |
2791 | RCU's scalability and latency must therefore be sufficient for | |
2792 | these sorts of configurations. | |
2793 | To my surprise, the sub-100-microsecond real-time latency budget | |
2794 | <a href="http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf"> | |
2795 | applies to even the largest systems [PDF]</a>, | |
2796 | up to and including systems with 4096 CPUs. | |
2797 | This real-time requirement motivated the grace-period kthread, which | |
2798 | also simplified handling of a number of race conditions. | |
2799 | ||
41abcf32 PM |
2800 | <p> |
2801 | RCU must avoid degrading real-time response for CPU-bound threads, whether | |
2802 | executing in usermode (which is one use case for | |
2803 | <tt>CONFIG_NO_HZ_FULL=y</tt>) or in the kernel. | |
2804 | That said, CPU-bound loops in the kernel must execute | |
f2b1760a | 2805 | <tt>cond_resched()</tt> at least once per few tens of milliseconds |
41abcf32 PM |
2806 | in order to avoid receiving an IPI from RCU. |
2807 | ||
649e4368 PM |
2808 | <p> |
2809 | Finally, RCU's status as a synchronization primitive means that | |
2810 | any RCU failure can result in arbitrary memory corruption that can be | |
2811 | extremely difficult to debug. | |
2812 | This means that RCU must be extremely reliable, which in | |
2813 | practice also means that RCU must have an aggressive stress-test | |
2814 | suite. | |
2815 | This stress-test suite is called <tt>rcutorture</tt>. | |
2816 | ||
2817 | <p> | |
2818 | Although the need for <tt>rcutorture</tt> was no surprise, | |
2819 | the current immense popularity of the Linux kernel is posing | |
2820 | interesting—and perhaps unprecedented—validation | |
2821 | challenges. | |
2822 | To see this, keep in mind that there are well over one billion | |
2823 | instances of the Linux kernel running today, given Android | |
2824 | smartphones, Linux-powered televisions, and servers. | |
2825 | This number can be expected to increase sharply with the advent of | |
2826 | the celebrated Internet of Things. | |
2827 | ||
2828 | <p> | |
2829 | Suppose that RCU contains a race condition that manifests on average | |
2830 | once per million years of runtime. | |
2831 | This bug will be occurring about three times per <i>day</i> across | |
2832 | the installed base. | |
2833 | RCU could simply hide behind hardware error rates, given that no one | |
2834 | should really expect their smartphone to last for a million years. | |
2835 | However, anyone taking too much comfort from this thought should | |
2836 | consider the fact that in most jurisdictions, a successful multi-year | |
2837 | test of a given mechanism, which might include a Linux kernel, | |
2838 | suffices for a number of types of safety-critical certifications. | |
2839 | In fact, rumor has it that the Linux kernel is already being used | |
2840 | in production for safety-critical applications. | |
2841 | I don't know about you, but I would feel quite bad if a bug in RCU | |
2842 | killed someone. | |
2843 | Which might explain my recent focus on validation and verification. | |
2844 | ||
2845 | <h2><a name="Other RCU Flavors">Other RCU Flavors</a></h2> | |
2846 | ||
2847 | <p> | |
2848 | One of the more surprising things about RCU is that there are now | |
2849 | no fewer than five <i>flavors</i>, or API families. | |
2850 | In addition, the primary flavor that has been the sole focus up to | |
2851 | this point has two different implementations, non-preemptible and | |
2852 | preemptible. | |
2853 | The other four flavors are listed below, with requirements for each | |
2854 | described in a separate section. | |
2855 | ||
2856 | <ol> | |
77095901 PM |
2857 | <li> <a href="#Bottom-Half Flavor">Bottom-Half Flavor (Historical)</a> |
2858 | <li> <a href="#Sched Flavor">Sched Flavor (Historical)</a> | |
649e4368 PM |
2859 | <li> <a href="#Sleepable RCU">Sleepable RCU</a> |
2860 | <li> <a href="#Tasks RCU">Tasks RCU</a> | |
2861 | </ol> | |
2862 | ||
77095901 PM |
2863 | <h3><a name="Bottom-Half Flavor">Bottom-Half Flavor (Historical)</a></h3> |
2864 | ||
2865 | <p> | |
2866 | The RCU-bh flavor of RCU has since been expressed in terms of | |
2867 | the other RCU flavors as part of a consolidation of the three | |
2868 | flavors into a single flavor. | |
2869 | The read-side API remains, and continues to disable softirq and to | |
2870 | be accounted for by lockdep. | |
2871 | Much of the material in this section is therefore strictly historical | |
2872 | in nature. | |
649e4368 PM |
2873 | |
2874 | <p> | |
2875 | The softirq-disable (AKA “bottom-half”, | |
2876 | hence the “_bh” abbreviations) | |
2877 | flavor of RCU, or <i>RCU-bh</i>, was developed by | |
2878 | Dipankar Sarma to provide a flavor of RCU that could withstand the | |
2879 | network-based denial-of-service attacks researched by Robert | |
2880 | Olsson. | |
2881 | These attacks placed so much networking load on the system | |
2882 | that some of the CPUs never exited softirq execution, | |
2883 | which in turn prevented those CPUs from ever executing a context switch, | |
2884 | which, in the RCU implementation of that time, prevented grace periods | |
2885 | from ever ending. | |
2886 | The result was an out-of-memory condition and a system hang. | |
2887 | ||
2888 | <p> | |
2889 | The solution was the creation of RCU-bh, which does | |
2890 | <tt>local_bh_disable()</tt> | |
2891 | across its read-side critical sections, and which uses the transition | |
2892 | from one type of softirq processing to another as a quiescent state | |
2893 | in addition to context switch, idle, user mode, and offline. | |
2894 | This means that RCU-bh grace periods can complete even when some of | |
2895 | the CPUs execute in softirq indefinitely, thus allowing algorithms | |
2896 | based on RCU-bh to withstand network-based denial-of-service attacks. | |
2897 | ||
2898 | <p> | |
2899 | Because | |
2900 | <tt>rcu_read_lock_bh()</tt> and <tt>rcu_read_unlock_bh()</tt> | |
2901 | disable and re-enable softirq handlers, any attempt to start a softirq | |
2902 | handlers during the | |
2903 | RCU-bh read-side critical section will be deferred. | |
2904 | In this case, <tt>rcu_read_unlock_bh()</tt> | |
2905 | will invoke softirq processing, which can take considerable time. | |
2906 | One can of course argue that this softirq overhead should be associated | |
2907 | with the code following the RCU-bh read-side critical section rather | |
2908 | than <tt>rcu_read_unlock_bh()</tt>, but the fact | |
2909 | is that most profiling tools cannot be expected to make this sort | |
2910 | of fine distinction. | |
2911 | For example, suppose that a three-millisecond-long RCU-bh read-side | |
2912 | critical section executes during a time of heavy networking load. | |
2913 | There will very likely be an attempt to invoke at least one softirq | |
2914 | handler during that three milliseconds, but any such invocation will | |
2915 | be delayed until the time of the <tt>rcu_read_unlock_bh()</tt>. | |
2916 | This can of course make it appear at first glance as if | |
2917 | <tt>rcu_read_unlock_bh()</tt> was executing very slowly. | |
2918 | ||
2919 | <p> | |
2920 | The | |
2921 | <a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-bh API</a> | |
2922 | includes | |
2923 | <tt>rcu_read_lock_bh()</tt>, | |
2924 | <tt>rcu_read_unlock_bh()</tt>, | |
2925 | <tt>rcu_dereference_bh()</tt>, | |
2926 | <tt>rcu_dereference_bh_check()</tt>, | |
2927 | <tt>synchronize_rcu_bh()</tt>, | |
2928 | <tt>synchronize_rcu_bh_expedited()</tt>, | |
2929 | <tt>call_rcu_bh()</tt>, | |
2930 | <tt>rcu_barrier_bh()</tt>, and | |
2931 | <tt>rcu_read_lock_bh_held()</tt>. | |
77095901 PM |
2932 | However, the update-side APIs are now simple wrappers for other RCU |
2933 | flavors, namely RCU-sched in CONFIG_PREEMPT=n kernels and RCU-preempt | |
2934 | otherwise. | |
649e4368 | 2935 | |
77095901 PM |
2936 | <h3><a name="Sched Flavor">Sched Flavor (Historical)</a></h3> |
2937 | ||
2938 | <p> | |
2939 | The RCU-sched flavor of RCU has since been expressed in terms of | |
2940 | the other RCU flavors as part of a consolidation of the three | |
2941 | flavors into a single flavor. | |
2942 | The read-side API remains, and continues to disable preemption and to | |
2943 | be accounted for by lockdep. | |
2944 | Much of the material in this section is therefore strictly historical | |
2945 | in nature. | |
649e4368 PM |
2946 | |
2947 | <p> | |
2948 | Before preemptible RCU, waiting for an RCU grace period had the | |
2949 | side effect of also waiting for all pre-existing interrupt | |
2950 | and NMI handlers. | |
2951 | However, there are legitimate preemptible-RCU implementations that | |
2952 | do not have this property, given that any point in the code outside | |
2953 | of an RCU read-side critical section can be a quiescent state. | |
2954 | Therefore, <i>RCU-sched</i> was created, which follows “classic” | |
2955 | RCU in that an RCU-sched grace period waits for for pre-existing | |
2956 | interrupt and NMI handlers. | |
2957 | In kernels built with <tt>CONFIG_PREEMPT=n</tt>, the RCU and RCU-sched | |
2958 | APIs have identical implementations, while kernels built with | |
2959 | <tt>CONFIG_PREEMPT=y</tt> provide a separate implementation for each. | |
2960 | ||
2961 | <p> | |
2962 | Note well that in <tt>CONFIG_PREEMPT=y</tt> kernels, | |
2963 | <tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt> | |
2964 | disable and re-enable preemption, respectively. | |
2965 | This means that if there was a preemption attempt during the | |
2966 | RCU-sched read-side critical section, <tt>rcu_read_unlock_sched()</tt> | |
2967 | will enter the scheduler, with all the latency and overhead entailed. | |
2968 | Just as with <tt>rcu_read_unlock_bh()</tt>, this can make it look | |
2969 | as if <tt>rcu_read_unlock_sched()</tt> was executing very slowly. | |
2970 | However, the highest-priority task won't be preempted, so that task | |
2971 | will enjoy low-overhead <tt>rcu_read_unlock_sched()</tt> invocations. | |
2972 | ||
2973 | <p> | |
2974 | The | |
2975 | <a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-sched API</a> | |
2976 | includes | |
2977 | <tt>rcu_read_lock_sched()</tt>, | |
2978 | <tt>rcu_read_unlock_sched()</tt>, | |
2979 | <tt>rcu_read_lock_sched_notrace()</tt>, | |
2980 | <tt>rcu_read_unlock_sched_notrace()</tt>, | |
2981 | <tt>rcu_dereference_sched()</tt>, | |
2982 | <tt>rcu_dereference_sched_check()</tt>, | |
2983 | <tt>synchronize_sched()</tt>, | |
2984 | <tt>synchronize_rcu_sched_expedited()</tt>, | |
2985 | <tt>call_rcu_sched()</tt>, | |
2986 | <tt>rcu_barrier_sched()</tt>, and | |
2987 | <tt>rcu_read_lock_sched_held()</tt>. | |
2988 | However, anything that disables preemption also marks an RCU-sched | |
2989 | read-side critical section, including | |
2990 | <tt>preempt_disable()</tt> and <tt>preempt_enable()</tt>, | |
2991 | <tt>local_irq_save()</tt> and <tt>local_irq_restore()</tt>, | |
2992 | and so on. | |
2993 | ||
2994 | <h3><a name="Sleepable RCU">Sleepable RCU</a></h3> | |
2995 | ||
2996 | <p> | |
2997 | For well over a decade, someone saying “I need to block within | |
2998 | an RCU read-side critical section” was a reliable indication | |
2999 | that this someone did not understand RCU. | |
3000 | After all, if you are always blocking in an RCU read-side critical | |
3001 | section, you can probably afford to use a higher-overhead synchronization | |
3002 | mechanism. | |
3003 | However, that changed with the advent of the Linux kernel's notifiers, | |
3004 | whose RCU read-side critical | |
3005 | sections almost never sleep, but sometimes need to. | |
3006 | This resulted in the introduction of | |
3007 | <a href="https://lwn.net/Articles/202847/">sleepable RCU</a>, | |
3008 | or <i>SRCU</i>. | |
3009 | ||
3010 | <p> | |
3011 | SRCU allows different domains to be defined, with each such domain | |
3012 | defined by an instance of an <tt>srcu_struct</tt> structure. | |
3013 | A pointer to this structure must be passed in to each SRCU function, | |
3014 | for example, <tt>synchronize_srcu(&ss)</tt>, where | |
3015 | <tt>ss</tt> is the <tt>srcu_struct</tt> structure. | |
3016 | The key benefit of these domains is that a slow SRCU reader in one | |
3017 | domain does not delay an SRCU grace period in some other domain. | |
3018 | That said, one consequence of these domains is that read-side code | |
3019 | must pass a “cookie” from <tt>srcu_read_lock()</tt> | |
3020 | to <tt>srcu_read_unlock()</tt>, for example, as follows: | |
3021 | ||
3022 | <blockquote> | |
3023 | <pre> | |
3024 | 1 int idx; | |
3025 | 2 | |
3026 | 3 idx = srcu_read_lock(&ss); | |
3027 | 4 do_something(); | |
3028 | 5 srcu_read_unlock(&ss, idx); | |
3029 | </pre> | |
3030 | </blockquote> | |
3031 | ||
3032 | <p> | |
3033 | As noted above, it is legal to block within SRCU read-side critical sections, | |
3034 | however, with great power comes great responsibility. | |
3035 | If you block forever in one of a given domain's SRCU read-side critical | |
3036 | sections, then that domain's grace periods will also be blocked forever. | |
3037 | Of course, one good way to block forever is to deadlock, which can | |
3038 | happen if any operation in a given domain's SRCU read-side critical | |
3039 | section can block waiting, either directly or indirectly, for that domain's | |
3040 | grace period to elapse. | |
3041 | For example, this results in a self-deadlock: | |
3042 | ||
3043 | <blockquote> | |
3044 | <pre> | |
3045 | 1 int idx; | |
3046 | 2 | |
3047 | 3 idx = srcu_read_lock(&ss); | |
3048 | 4 do_something(); | |
3049 | 5 synchronize_srcu(&ss); | |
3050 | 6 srcu_read_unlock(&ss, idx); | |
3051 | </pre> | |
3052 | </blockquote> | |
3053 | ||
3054 | <p> | |
3055 | However, if line 5 acquired a mutex that was held across | |
3056 | a <tt>synchronize_srcu()</tt> for domain <tt>ss</tt>, | |
3057 | deadlock would still be possible. | |
3058 | Furthermore, if line 5 acquired a mutex that was held across | |
3059 | a <tt>synchronize_srcu()</tt> for some other domain <tt>ss1</tt>, | |
3060 | and if an <tt>ss1</tt>-domain SRCU read-side critical section | |
3061 | acquired another mutex that was held across as <tt>ss</tt>-domain | |
3062 | <tt>synchronize_srcu()</tt>, | |
3063 | deadlock would again be possible. | |
3064 | Such a deadlock cycle could extend across an arbitrarily large number | |
3065 | of different SRCU domains. | |
3066 | Again, with great power comes great responsibility. | |
3067 | ||
3068 | <p> | |
3069 | Unlike the other RCU flavors, SRCU read-side critical sections can | |
3070 | run on idle and even offline CPUs. | |
3071 | This ability requires that <tt>srcu_read_lock()</tt> and | |
3072 | <tt>srcu_read_unlock()</tt> contain memory barriers, which means | |
3073 | that SRCU readers will run a bit slower than would RCU readers. | |
3074 | It also motivates the <tt>smp_mb__after_srcu_read_unlock()</tt> | |
3075 | API, which, in combination with <tt>srcu_read_unlock()</tt>, | |
3076 | guarantees a full memory barrier. | |
3077 | ||
6771853b PM |
3078 | <p> |
3079 | Also unlike other RCU flavors, SRCU's callbacks-wait function | |
3080 | <tt>srcu_barrier()</tt> may be invoked from CPU-hotplug notifiers, | |
3081 | though this is not necessarily a good idea. | |
3082 | The reason that this is possible is that SRCU is insensitive | |
3083 | to whether or not a CPU is online, which means that <tt>srcu_barrier()</tt> | |
3084 | need not exclude CPU-hotplug operations. | |
3085 | ||
09f501a0 PM |
3086 | <p> |
3087 | SRCU also differs from other RCU flavors in that SRCU's expedited and | |
3088 | non-expedited grace periods are implemented by the same mechanism. | |
3089 | This means that in the current SRCU implementation, expediting a | |
3090 | future grace period has the side effect of expediting all prior | |
3091 | grace periods that have not yet completed. | |
3092 | (But please note that this is a property of the current implementation, | |
3093 | not necessarily of future implementations.) | |
3094 | In addition, if SRCU has been idle for longer than the interval | |
3095 | specified by the <tt>srcutree.exp_holdoff</tt> kernel boot parameter | |
3096 | (25 microseconds by default), | |
3097 | and if a <tt>synchronize_srcu()</tt> invocation ends this idle period, | |
3098 | that invocation will be automatically expedited. | |
3099 | ||
6771853b PM |
3100 | <p> |
3101 | As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating | |
3102 | a locking bottleneck present in prior kernel versions. | |
3103 | Although this will allow users to put much heavier stress on | |
3104 | <tt>call_srcu()</tt>, it is important to note that SRCU does not | |
3105 | yet take any special steps to deal with callback flooding. | |
3106 | So if you are posting (say) 10,000 SRCU callbacks per second per CPU, | |
3107 | you are probably totally OK, but if you intend to post (say) 1,000,000 | |
3108 | SRCU callbacks per second per CPU, please run some tests first. | |
3109 | SRCU just might need a few adjustment to deal with that sort of load. | |
3110 | Of course, your mileage may vary based on the speed of your CPUs and | |
3111 | the size of your memory. | |
3112 | ||
649e4368 PM |
3113 | <p> |
3114 | The | |
3115 | <a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">SRCU API</a> | |
3116 | includes | |
3117 | <tt>srcu_read_lock()</tt>, | |
3118 | <tt>srcu_read_unlock()</tt>, | |
3119 | <tt>srcu_dereference()</tt>, | |
3120 | <tt>srcu_dereference_check()</tt>, | |
3121 | <tt>synchronize_srcu()</tt>, | |
3122 | <tt>synchronize_srcu_expedited()</tt>, | |
3123 | <tt>call_srcu()</tt>, | |
3124 | <tt>srcu_barrier()</tt>, and | |
3125 | <tt>srcu_read_lock_held()</tt>. | |
3126 | It also includes | |
3127 | <tt>DEFINE_SRCU()</tt>, | |
3128 | <tt>DEFINE_STATIC_SRCU()</tt>, and | |
3129 | <tt>init_srcu_struct()</tt> | |
3130 | APIs for defining and initializing <tt>srcu_struct</tt> structures. | |
3131 | ||
3132 | <h3><a name="Tasks RCU">Tasks RCU</a></h3> | |
3133 | ||
3134 | <p> | |
526914a0 | 3135 | Some forms of tracing use “trampolines” to handle the |
649e4368 PM |
3136 | binary rewriting required to install different types of probes. |
3137 | It would be good to be able to free old trampolines, which sounds | |
3138 | like a job for some form of RCU. | |
3139 | However, because it is necessary to be able to install a trace | |
3140 | anywhere in the code, it is not possible to use read-side markers | |
3141 | such as <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>. | |
3142 | In addition, it does not work to have these markers in the trampoline | |
3143 | itself, because there would need to be instructions following | |
3144 | <tt>rcu_read_unlock()</tt>. | |
3145 | Although <tt>synchronize_rcu()</tt> would guarantee that execution | |
3146 | reached the <tt>rcu_read_unlock()</tt>, it would not be able to | |
3147 | guarantee that execution had completely left the trampoline. | |
3148 | ||
3149 | <p> | |
3150 | The solution, in the form of | |
3151 | <a href="https://lwn.net/Articles/607117/"><i>Tasks RCU</i></a>, | |
3152 | is to have implicit | |
3153 | read-side critical sections that are delimited by voluntary context | |
3154 | switches, that is, calls to <tt>schedule()</tt>, | |
f2b1760a | 3155 | <tt>cond_resched()</tt>, and |
649e4368 PM |
3156 | <tt>synchronize_rcu_tasks()</tt>. |
3157 | In addition, transitions to and from userspace execution also delimit | |
3158 | tasks-RCU read-side critical sections. | |
3159 | ||
3160 | <p> | |
3161 | The tasks-RCU API is quite compact, consisting only of | |
3162 | <tt>call_rcu_tasks()</tt>, | |
3163 | <tt>synchronize_rcu_tasks()</tt>, and | |
3164 | <tt>rcu_barrier_tasks()</tt>. | |
77095901 PM |
3165 | In <tt>CONFIG_PREEMPT=n</tt> kernels, trampolines cannot be preempted, |
3166 | so these APIs map to | |
3167 | <tt>call_rcu()</tt>, | |
3168 | <tt>synchronize_rcu()</tt>, and | |
3169 | <tt>rcu_barrier()</tt>, respectively. | |
3170 | In <tt>CONFIG_PREEMPT=y</tt> kernels, trampolines can be preempted, | |
3171 | and these three APIs are therefore implemented by separate functions | |
3172 | that check for voluntary context switches. | |
f43b6254 | 3173 | |
649e4368 PM |
3174 | <h2><a name="Possible Future Changes">Possible Future Changes</a></h2> |
3175 | ||
3176 | <p> | |
3177 | One of the tricks that RCU uses to attain update-side scalability is | |
3178 | to increase grace-period latency with increasing numbers of CPUs. | |
3179 | If this becomes a serious problem, it will be necessary to rework the | |
3180 | grace-period state machine so as to avoid the need for the additional | |
3181 | latency. | |
3182 | ||
649e4368 PM |
3183 | <p> |
3184 | RCU disables CPU hotplug in a few places, perhaps most notably in the | |
6771853b PM |
3185 | <tt>rcu_barrier()</tt> operations. |
3186 | If there is a strong reason to use <tt>rcu_barrier()</tt> in CPU-hotplug | |
649e4368 PM |
3187 | notifiers, it will be necessary to avoid disabling CPU hotplug. |
3188 | This would introduce some complexity, so there had better be a <i>very</i> | |
3189 | good reason. | |
3190 | ||
3191 | <p> | |
3192 | The tradeoff between grace-period latency on the one hand and interruptions | |
3193 | of other CPUs on the other hand may need to be re-examined. | |
3194 | The desire is of course for zero grace-period latency as well as zero | |
3195 | interprocessor interrupts undertaken during an expedited grace period | |
3196 | operation. | |
3197 | While this ideal is unlikely to be achievable, it is quite possible that | |
3198 | further improvements can be made. | |
3199 | ||
3200 | <p> | |
3201 | The multiprocessor implementations of RCU use a combining tree that | |
3202 | groups CPUs so as to reduce lock contention and increase cache locality. | |
3203 | However, this combining tree does not spread its memory across NUMA | |
3204 | nodes nor does it align the CPU groups with hardware features such | |
3205 | as sockets or cores. | |
3206 | Such spreading and alignment is currently believed to be unnecessary | |
3207 | because the hotpath read-side primitives do not access the combining | |
3208 | tree, nor does <tt>call_rcu()</tt> in the common case. | |
3209 | If you believe that your architecture needs such spreading and alignment, | |
3210 | then your architecture should also benefit from the | |
3211 | <tt>rcutree.rcu_fanout_leaf</tt> boot parameter, which can be set | |
3212 | to the number of CPUs in a socket, NUMA node, or whatever. | |
3213 | If the number of CPUs is too large, use a fraction of the number of | |
3214 | CPUs. | |
3215 | If the number of CPUs is a large prime number, well, that certainly | |
3216 | is an “interesting” architectural choice! | |
3217 | More flexible arrangements might be considered, but only if | |
3218 | <tt>rcutree.rcu_fanout_leaf</tt> has proven inadequate, and only | |
3219 | if the inadequacy has been demonstrated by a carefully run and | |
3220 | realistic system-level workload. | |
3221 | ||
3222 | <p> | |
3223 | Please note that arrangements that require RCU to remap CPU numbers will | |
3224 | require extremely good demonstration of need and full exploration of | |
3225 | alternatives. | |
3226 | ||
649e4368 PM |
3227 | <p> |
3228 | RCU's various kthreads are reasonably recent additions. | |
3229 | It is quite likely that adjustments will be required to more gracefully | |
3230 | handle extreme loads. | |
3231 | It might also be necessary to be able to relate CPU utilization by | |
3232 | RCU's kthreads and softirq handlers to the code that instigated this | |
3233 | CPU utilization. | |
3234 | For example, RCU callback overhead might be charged back to the | |
3235 | originating <tt>call_rcu()</tt> instance, though probably not | |
3236 | in production kernels. | |
3237 | ||
3238 | <h2><a name="Summary">Summary</a></h2> | |
3239 | ||
3240 | <p> | |
3241 | This document has presented more than two decade's worth of RCU | |
3242 | requirements. | |
3243 | Given that the requirements keep changing, this will not be the last | |
3244 | word on this subject, but at least it serves to get an important | |
3245 | subset of the requirements set forth. | |
3246 | ||
3247 | <h2><a name="Acknowledgments">Acknowledgments</a></h2> | |
3248 | ||
3249 | I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, | |
3250 | Oleg Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and | |
3251 | Andy Lutomirski for their help in rendering | |
3252 | this article human readable, and to Michelle Rankin for her support | |
3253 | of this effort. | |
3254 | Other contributions are acknowledged in the Linux kernel's git archive. | |
649e4368 | 3255 | |
649e4368 | 3256 | </body></html> |