]>
Commit | Line | Data |
---|---|---|
1c27b644 PM |
1 | This document provides "recipes", that is, litmus tests for commonly |
2 | occurring situations, as well as a few that illustrate subtly broken but | |
3 | attractive nuisances. Many of these recipes include example code from | |
4 | v4.13 of the Linux kernel. | |
5 | ||
6 | The first section covers simple special cases, the second section | |
7 | takes off the training wheels to cover more involved examples, | |
8 | and the third section provides a few rules of thumb. | |
9 | ||
10 | ||
11 | Simple special cases | |
12 | ==================== | |
13 | ||
14 | This section presents two simple special cases, the first being where | |
15 | there is only one CPU or only one memory location is accessed, and the | |
16 | second being use of that old concurrency workhorse, locking. | |
17 | ||
18 | ||
19 | Single CPU or single memory location | |
20 | ------------------------------------ | |
21 | ||
22 | If there is only one CPU on the one hand or only one variable | |
23 | on the other, the code will execute in order. There are (as | |
24 | usual) some things to be careful of: | |
25 | ||
26 | 1. Some aspects of the C language are unordered. For example, | |
27 | in the expression "f(x) + g(y)", the order in which f and g are | |
28 | called is not defined; the object code is allowed to use either | |
29 | order or even to interleave the computations. | |
30 | ||
31 | 2. Compilers are permitted to use the "as-if" rule. That is, a | |
32 | compiler can emit whatever code it likes for normal accesses, | |
33 | as long as the results of a single-threaded execution appear | |
34 | just as if the compiler had followed all the relevant rules. | |
35 | To see this, compile with a high level of optimization and run | |
36 | the debugger on the resulting binary. | |
37 | ||
38 | 3. If there is only one variable but multiple CPUs, that variable | |
39 | must be properly aligned and all accesses to that variable must | |
40 | be full sized. Variables that straddle cachelines or pages void | |
41 | your full-ordering warranty, as do undersized accesses that load | |
42 | from or store to only part of the variable. | |
43 | ||
44 | 4. If there are multiple CPUs, accesses to shared variables should | |
45 | use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store | |
46 | tearing, load/store fusing, and invented loads and stores. | |
47 | There are exceptions to this rule, including: | |
48 | ||
49 | i. When there is no possibility of a given shared variable | |
50 | being updated by some other CPU, for example, while | |
51 | holding the update-side lock, reads from that variable | |
52 | need not use READ_ONCE(). | |
53 | ||
54 | ii. When there is no possibility of a given shared variable | |
55 | being either read or updated by other CPUs, for example, | |
56 | when running during early boot, reads from that variable | |
57 | need not use READ_ONCE() and writes to that variable | |
58 | need not use WRITE_ONCE(). | |
59 | ||
60 | ||
61 | Locking | |
62 | ------- | |
63 | ||
64 | Locking is well-known and straightforward, at least if you don't think | |
65 | about it too hard. And the basic rule is indeed quite simple: Any CPU that | |
66 | has acquired a given lock sees any changes previously seen or made by any | |
67 | CPU before it released that same lock. Note that this statement is a bit | |
68 | stronger than "Any CPU holding a given lock sees all changes made by any | |
69 | CPU during the time that CPU was holding this same lock". For example, | |
70 | consider the following pair of code fragments: | |
71 | ||
72 | /* See MP+polocks.litmus. */ | |
73 | void CPU0(void) | |
74 | { | |
75 | WRITE_ONCE(x, 1); | |
76 | spin_lock(&mylock); | |
77 | WRITE_ONCE(y, 1); | |
78 | spin_unlock(&mylock); | |
79 | } | |
80 | ||
81 | void CPU1(void) | |
82 | { | |
83 | spin_lock(&mylock); | |
84 | r0 = READ_ONCE(y); | |
85 | spin_unlock(&mylock); | |
86 | r1 = READ_ONCE(x); | |
87 | } | |
88 | ||
89 | The basic rule guarantees that if CPU0() acquires mylock before CPU1(), | |
90 | then both r0 and r1 must be set to the value 1. This also has the | |
91 | consequence that if the final value of r0 is equal to 1, then the final | |
92 | value of r1 must also be equal to 1. In contrast, the weaker rule would | |
93 | say nothing about the final value of r1. | |
94 | ||
95 | The converse to the basic rule also holds, as illustrated by the | |
96 | following litmus test: | |
97 | ||
98 | /* See MP+porevlocks.litmus. */ | |
99 | void CPU0(void) | |
100 | { | |
101 | r0 = READ_ONCE(y); | |
102 | spin_lock(&mylock); | |
103 | r1 = READ_ONCE(x); | |
104 | spin_unlock(&mylock); | |
105 | } | |
106 | ||
107 | void CPU1(void) | |
108 | { | |
109 | spin_lock(&mylock); | |
110 | WRITE_ONCE(x, 1); | |
111 | spin_unlock(&mylock); | |
112 | WRITE_ONCE(y, 1); | |
113 | } | |
114 | ||
115 | This converse to the basic rule guarantees that if CPU0() acquires | |
116 | mylock before CPU1(), then both r0 and r1 must be set to the value 0. | |
117 | This also has the consequence that if the final value of r1 is equal | |
118 | to 0, then the final value of r0 must also be equal to 0. In contrast, | |
119 | the weaker rule would say nothing about the final value of r0. | |
120 | ||
121 | These examples show only a single pair of CPUs, but the effects of the | |
122 | locking basic rule extend across multiple acquisitions of a given lock | |
123 | across multiple CPUs. | |
124 | ||
125 | However, it is not necessarily the case that accesses ordered by | |
126 | locking will be seen as ordered by CPUs not holding that lock. | |
127 | Consider this example: | |
128 | ||
129 | /* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */ | |
130 | void CPU0(void) | |
131 | { | |
132 | spin_lock(&mylock); | |
133 | WRITE_ONCE(x, 1); | |
134 | WRITE_ONCE(y, 1); | |
135 | spin_unlock(&mylock); | |
136 | } | |
137 | ||
138 | void CPU1(void) | |
139 | { | |
140 | spin_lock(&mylock); | |
141 | r0 = READ_ONCE(y); | |
142 | WRITE_ONCE(z, 1); | |
143 | spin_unlock(&mylock); | |
144 | } | |
145 | ||
146 | void CPU2(void) | |
147 | { | |
148 | WRITE_ONCE(z, 2); | |
149 | smp_mb(); | |
150 | r1 = READ_ONCE(x); | |
151 | } | |
152 | ||
153 | Counter-intuitive though it might be, it is quite possible to have | |
154 | the final value of r0 be 1, the final value of z be 2, and the final | |
155 | value of r1 be 0. The reason for this surprising outcome is that | |
156 | CPU2() never acquired the lock, and thus did not benefit from the | |
157 | lock's ordering properties. | |
158 | ||
159 | Ordering can be extended to CPUs not holding the lock by careful use | |
160 | of smp_mb__after_spinlock(): | |
161 | ||
162 | /* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */ | |
163 | void CPU0(void) | |
164 | { | |
165 | spin_lock(&mylock); | |
166 | WRITE_ONCE(x, 1); | |
167 | WRITE_ONCE(y, 1); | |
168 | spin_unlock(&mylock); | |
169 | } | |
170 | ||
171 | void CPU1(void) | |
172 | { | |
173 | spin_lock(&mylock); | |
174 | smp_mb__after_spinlock(); | |
175 | r0 = READ_ONCE(y); | |
176 | WRITE_ONCE(z, 1); | |
177 | spin_unlock(&mylock); | |
178 | } | |
179 | ||
180 | void CPU2(void) | |
181 | { | |
182 | WRITE_ONCE(z, 2); | |
183 | smp_mb(); | |
184 | r1 = READ_ONCE(x); | |
185 | } | |
186 | ||
187 | This addition of smp_mb__after_spinlock() strengthens the lock acquisition | |
188 | sufficiently to rule out the counter-intuitive outcome. | |
189 | ||
190 | ||
191 | Taking off the training wheels | |
192 | ============================== | |
193 | ||
194 | This section looks at more complex examples, including message passing, | |
195 | load buffering, release-acquire chains, store buffering. | |
196 | Many classes of litmus tests have abbreviated names, which may be found | |
197 | here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf | |
198 | ||
199 | ||
200 | Message passing (MP) | |
201 | -------------------- | |
202 | ||
203 | The MP pattern has one CPU execute a pair of stores to a pair of variables | |
204 | and another CPU execute a pair of loads from this same pair of variables, | |
205 | but in the opposite order. The goal is to avoid the counter-intuitive | |
206 | outcome in which the first load sees the value written by the second store | |
207 | but the second load does not see the value written by the first store. | |
208 | In the absence of any ordering, this goal may not be met, as can be seen | |
209 | in the MP+poonceonces.litmus litmus test. This section therefore looks at | |
210 | a number of ways of meeting this goal. | |
211 | ||
212 | ||
213 | Release and acquire | |
214 | ~~~~~~~~~~~~~~~~~~~ | |
215 | ||
216 | Use of smp_store_release() and smp_load_acquire() is one way to force | |
217 | the desired MP ordering. The general approach is shown below: | |
218 | ||
219 | /* See MP+pooncerelease+poacquireonce.litmus. */ | |
220 | void CPU0(void) | |
221 | { | |
222 | WRITE_ONCE(x, 1); | |
223 | smp_store_release(&y, 1); | |
224 | } | |
225 | ||
226 | void CPU1(void) | |
227 | { | |
228 | r0 = smp_load_acquire(&y); | |
229 | r1 = READ_ONCE(x); | |
230 | } | |
231 | ||
232 | The smp_store_release() macro orders any prior accesses against the | |
233 | store, while the smp_load_acquire macro orders the load against any | |
234 | subsequent accesses. Therefore, if the final value of r0 is the value 1, | |
235 | the final value of r1 must also be the value 1. | |
236 | ||
237 | The init_stack_slab() function in lib/stackdepot.c uses release-acquire | |
238 | in this way to safely initialize of a slab of the stack. Working out | |
239 | the mutual-exclusion design is left as an exercise for the reader. | |
240 | ||
241 | ||
242 | Assign and dereference | |
243 | ~~~~~~~~~~~~~~~~~~~~~~ | |
244 | ||
245 | Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the | |
246 | use of smp_store_release() and smp_load_acquire(), except that both | |
247 | rcu_assign_pointer() and rcu_dereference() operate on RCU-protected | |
248 | pointers. The general approach is shown below: | |
249 | ||
250 | /* See MP+onceassign+derefonce.litmus. */ | |
251 | int z; | |
252 | int *y = &z; | |
253 | int x; | |
254 | ||
255 | void CPU0(void) | |
256 | { | |
257 | WRITE_ONCE(x, 1); | |
258 | rcu_assign_pointer(y, &x); | |
259 | } | |
260 | ||
261 | void CPU1(void) | |
262 | { | |
263 | rcu_read_lock(); | |
264 | r0 = rcu_dereference(y); | |
265 | r1 = READ_ONCE(*r0); | |
266 | rcu_read_unlock(); | |
267 | } | |
268 | ||
269 | In this example, if the final value of r0 is &x then the final value of | |
270 | r1 must be 1. | |
271 | ||
272 | The rcu_assign_pointer() macro has the same ordering properties as does | |
273 | smp_store_release(), but the rcu_dereference() macro orders the load only | |
274 | against later accesses that depend on the value loaded. A dependency | |
275 | is present if the value loaded determines the address of a later access | |
276 | (address dependency, as shown above), the value written by a later store | |
277 | (data dependency), or whether or not a later store is executed in the | |
278 | first place (control dependency). Note that the term "data dependency" | |
279 | is sometimes casually used to cover both address and data dependencies. | |
280 | ||
281 | In lib/prime_numbers.c, the expand_to_next_prime() function invokes | |
282 | rcu_assign_pointer(), and the next_prime_number() function invokes | |
283 | rcu_dereference(). This combination mediates access to a bit vector | |
284 | that is expanded as additional primes are needed. | |
285 | ||
286 | ||
287 | Write and read memory barriers | |
288 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
289 | ||
290 | It is usually better to use smp_store_release() instead of smp_wmb() | |
291 | and to use smp_load_acquire() instead of smp_rmb(). However, the older | |
292 | smp_wmb() and smp_rmb() APIs are still heavily used, so it is important | |
293 | to understand their use cases. The general approach is shown below: | |
294 | ||
295 | /* See MP+wmbonceonce+rmbonceonce.litmus. */ | |
296 | void CPU0(void) | |
297 | { | |
298 | WRITE_ONCE(x, 1); | |
299 | smp_wmb(); | |
300 | WRITE_ONCE(y, 1); | |
301 | } | |
302 | ||
303 | void CPU1(void) | |
304 | { | |
305 | r0 = READ_ONCE(y); | |
306 | smp_rmb(); | |
307 | r1 = READ_ONCE(x); | |
308 | } | |
309 | ||
310 | The smp_wmb() macro orders prior stores against later stores, and the | |
311 | smp_rmb() macro orders prior loads against later loads. Therefore, if | |
312 | the final value of r0 is 1, the final value of r1 must also be 1. | |
313 | ||
314 | The the xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains | |
315 | the following write-side code fragment: | |
316 | ||
317 | log->l_curr_block -= log->l_logBBsize; | |
318 | ASSERT(log->l_curr_block >= 0); | |
319 | smp_wmb(); | |
320 | log->l_curr_cycle++; | |
321 | ||
322 | And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains | |
323 | the corresponding read-side code fragment: | |
324 | ||
5bde06b6 | 325 | cur_cycle = READ_ONCE(log->l_curr_cycle); |
1c27b644 | 326 | smp_rmb(); |
5bde06b6 | 327 | cur_block = READ_ONCE(log->l_curr_block); |
1c27b644 PM |
328 | |
329 | Alternatively, consider the following comment in function | |
330 | perf_output_put_handle() in kernel/events/ring_buffer.c: | |
331 | ||
332 | * kernel user | |
333 | * | |
334 | * if (LOAD ->data_tail) { LOAD ->data_head | |
335 | * (A) smp_rmb() (C) | |
336 | * STORE $data LOAD $data | |
337 | * smp_wmb() (B) smp_mb() (D) | |
338 | * STORE ->data_head STORE ->data_tail | |
339 | * } | |
340 | ||
341 | The B/C pairing is an example of the MP pattern using smp_wmb() on the | |
342 | write side and smp_rmb() on the read side. | |
343 | ||
344 | Of course, given that smp_mb() is strictly stronger than either smp_wmb() | |
345 | or smp_rmb(), any code fragment that would work with smp_rmb() and | |
346 | smp_wmb() would also work with smp_mb() replacing either or both of the | |
347 | weaker barriers. | |
348 | ||
349 | ||
350 | Load buffering (LB) | |
351 | ------------------- | |
352 | ||
353 | The LB pattern has one CPU load from one variable and then store to a | |
354 | second, while another CPU loads from the second variable and then stores | |
355 | to the first. The goal is to avoid the counter-intuitive situation where | |
356 | each load reads the value written by the other CPU's store. In the | |
357 | absence of any ordering it is quite possible that this may happen, as | |
358 | can be seen in the LB+poonceonces.litmus litmus test. | |
359 | ||
360 | One way of avoiding the counter-intuitive outcome is through the use of a | |
361 | control dependency paired with a full memory barrier: | |
362 | ||
363 | /* See LB+ctrlonceonce+mbonceonce.litmus. */ | |
364 | void CPU0(void) | |
365 | { | |
366 | r0 = READ_ONCE(x); | |
367 | if (r0) | |
368 | WRITE_ONCE(y, 1); | |
369 | } | |
370 | ||
371 | void CPU1(void) | |
372 | { | |
373 | r1 = READ_ONCE(y); | |
374 | smp_mb(); | |
375 | WRITE_ONCE(x, 1); | |
376 | } | |
377 | ||
378 | This pairing of a control dependency in CPU0() with a full memory | |
379 | barrier in CPU1() prevents r0 and r1 from both ending up equal to 1. | |
380 | ||
381 | The A/D pairing from the ring-buffer use case shown earlier also | |
382 | illustrates LB. Here is a repeat of the comment in | |
383 | perf_output_put_handle() in kernel/events/ring_buffer.c, showing a | |
384 | control dependency on the kernel side and a full memory barrier on | |
385 | the user side: | |
386 | ||
387 | * kernel user | |
388 | * | |
389 | * if (LOAD ->data_tail) { LOAD ->data_head | |
390 | * (A) smp_rmb() (C) | |
391 | * STORE $data LOAD $data | |
392 | * smp_wmb() (B) smp_mb() (D) | |
393 | * STORE ->data_head STORE ->data_tail | |
394 | * } | |
395 | * | |
396 | * Where A pairs with D, and B pairs with C. | |
397 | ||
398 | The kernel's control dependency between the load from ->data_tail | |
399 | and the store to data combined with the user's full memory barrier | |
400 | between the load from data and the store to ->data_tail prevents | |
401 | the counter-intuitive outcome where the kernel overwrites the data | |
402 | before the user gets done loading it. | |
403 | ||
404 | ||
405 | Release-acquire chains | |
406 | ---------------------- | |
407 | ||
408 | Release-acquire chains are a low-overhead, flexible, and easy-to-use | |
409 | method of maintaining order. However, they do have some limitations that | |
410 | need to be fully understood. Here is an example that maintains order: | |
411 | ||
412 | /* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */ | |
413 | void CPU0(void) | |
414 | { | |
415 | WRITE_ONCE(x, 1); | |
416 | smp_store_release(&y, 1); | |
417 | } | |
418 | ||
419 | void CPU1(void) | |
420 | { | |
421 | r0 = smp_load_acquire(y); | |
422 | smp_store_release(&z, 1); | |
423 | } | |
424 | ||
425 | void CPU2(void) | |
426 | { | |
427 | r1 = smp_load_acquire(z); | |
428 | r2 = READ_ONCE(x); | |
429 | } | |
430 | ||
431 | In this case, if r0 and r1 both have final values of 1, then r2 must | |
432 | also have a final value of 1. | |
433 | ||
434 | The ordering in this example is stronger than it needs to be. For | |
435 | example, ordering would still be preserved if CPU1()'s smp_load_acquire() | |
436 | invocation was replaced with READ_ONCE(). | |
437 | ||
438 | It is tempting to assume that CPU0()'s store to x is globally ordered | |
439 | before CPU1()'s store to z, but this is not the case: | |
440 | ||
441 | /* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */ | |
442 | void CPU0(void) | |
443 | { | |
444 | WRITE_ONCE(x, 1); | |
445 | smp_store_release(&y, 1); | |
446 | } | |
447 | ||
448 | void CPU1(void) | |
449 | { | |
450 | r0 = smp_load_acquire(y); | |
451 | smp_store_release(&z, 1); | |
452 | } | |
453 | ||
454 | void CPU2(void) | |
455 | { | |
456 | WRITE_ONCE(z, 2); | |
457 | smp_mb(); | |
458 | r1 = READ_ONCE(x); | |
459 | } | |
460 | ||
461 | One might hope that if the final value of r0 is 1 and the final value | |
462 | of z is 2, then the final value of r1 must also be 1, but it really is | |
463 | possible for r1 to have the final value of 0. The reason, of course, | |
464 | is that in this version, CPU2() is not part of the release-acquire chain. | |
465 | This situation is accounted for in the rules of thumb below. | |
466 | ||
467 | Despite this limitation, release-acquire chains are low-overhead as | |
468 | well as simple and powerful, at least as memory-ordering mechanisms go. | |
469 | ||
470 | ||
471 | Store buffering | |
472 | --------------- | |
473 | ||
474 | Store buffering can be thought of as upside-down load buffering, so | |
475 | that one CPU first stores to one variable and then loads from a second, | |
476 | while another CPU stores to the second variable and then loads from the | |
477 | first. Preserving order requires nothing less than full barriers: | |
478 | ||
479 | /* See SB+mbonceonces.litmus. */ | |
480 | void CPU0(void) | |
481 | { | |
482 | WRITE_ONCE(x, 1); | |
483 | smp_mb(); | |
484 | r0 = READ_ONCE(y); | |
485 | } | |
486 | ||
487 | void CPU1(void) | |
488 | { | |
489 | WRITE_ONCE(y, 1); | |
490 | smp_mb(); | |
491 | r1 = READ_ONCE(x); | |
492 | } | |
493 | ||
494 | Omitting either smp_mb() will allow both r0 and r1 to have final | |
495 | values of 0, but providing both full barriers as shown above prevents | |
496 | this counter-intuitive outcome. | |
497 | ||
498 | This pattern most famously appears as part of Dekker's locking | |
499 | algorithm, but it has a much more practical use within the Linux kernel | |
500 | of ordering wakeups. The following comment taken from waitqueue_active() | |
501 | in include/linux/wait.h shows the canonical pattern: | |
502 | ||
503 | * CPU0 - waker CPU1 - waiter | |
504 | * | |
505 | * for (;;) { | |
506 | * @cond = true; prepare_to_wait(&wq_head, &wait, state); | |
507 | * smp_mb(); // smp_mb() from set_current_state() | |
508 | * if (waitqueue_active(wq_head)) if (@cond) | |
509 | * wake_up(wq_head); break; | |
510 | * schedule(); | |
511 | * } | |
512 | * finish_wait(&wq_head, &wait); | |
513 | ||
514 | On CPU0, the store is to @cond and the load is in waitqueue_active(). | |
515 | On CPU1, prepare_to_wait() contains both a store to wq_head and a call | |
516 | to set_current_state(), which contains an smp_mb() barrier; the load is | |
517 | "if (@cond)". The full barriers prevent the undesirable outcome where | |
518 | CPU1 puts the waiting task to sleep and CPU0 fails to wake it up. | |
519 | ||
520 | Note that use of locking can greatly simplify this pattern. | |
521 | ||
522 | ||
523 | Rules of thumb | |
524 | ============== | |
525 | ||
526 | There might seem to be no pattern governing what ordering primitives are | |
527 | needed in which situations, but this is not the case. There is a pattern | |
528 | based on the relation between the accesses linking successive CPUs in a | |
529 | given litmus test. There are three types of linkage: | |
530 | ||
531 | 1. Write-to-read, where the next CPU reads the value that the | |
532 | previous CPU wrote. The LB litmus-test patterns contain only | |
533 | this type of relation. In formal memory-modeling texts, this | |
534 | relation is called "reads-from" and is usually abbreviated "rf". | |
535 | ||
536 | 2. Read-to-write, where the next CPU overwrites the value that the | |
537 | previous CPU read. The SB litmus test contains only this type | |
538 | of relation. In formal memory-modeling texts, this relation is | |
539 | often called "from-reads" and is sometimes abbreviated "fr". | |
540 | ||
541 | 3. Write-to-write, where the next CPU overwrites the value written | |
542 | by the previous CPU. The Z6.0 litmus test pattern contains a | |
543 | write-to-write relation between the last access of CPU1() and | |
544 | the first access of CPU2(). In formal memory-modeling texts, | |
545 | this relation is often called "coherence order" and is sometimes | |
546 | abbreviated "co". In the C++ standard, it is instead called | |
547 | "modification order" and often abbreviated "mo". | |
548 | ||
549 | The strength of memory ordering required for a given litmus test to | |
550 | avoid a counter-intuitive outcome depends on the types of relations | |
551 | linking the memory accesses for the outcome in question: | |
552 | ||
553 | o If all links are write-to-read links, then the weakest | |
554 | possible ordering within each CPU suffices. For example, in | |
555 | the LB litmus test, a control dependency was enough to do the | |
556 | job. | |
557 | ||
558 | o If all but one of the links are write-to-read links, then a | |
559 | release-acquire chain suffices. Both the MP and the ISA2 | |
560 | litmus tests illustrate this case. | |
561 | ||
562 | o If more than one of the links are something other than | |
563 | write-to-read links, then a full memory barrier is required | |
564 | between each successive pair of non-write-to-read links. This | |
565 | case is illustrated by the Z6.0 litmus tests, both in the | |
566 | locking and in the release-acquire sections. | |
567 | ||
568 | However, if you find yourself having to stretch these rules of thumb | |
569 | to fit your situation, you should consider creating a litmus test and | |
570 | running it on the model. |