]> git.proxmox.com Git - mirror_ubuntu-zesty-kernel.git/blame - Documentation/memory-barriers.txt
mutexes: Give more informative mutex warning in the !lock->owner case
[mirror_ubuntu-zesty-kernel.git] / Documentation / memory-barriers.txt
CommitLineData
108b42b4
DH
1 ============================
2 LINUX KERNEL MEMORY BARRIERS
3 ============================
4
5By: David Howells <dhowells@redhat.com>
90fddabf 6 Paul E. McKenney <paulmck@linux.vnet.ibm.com>
108b42b4
DH
7
8Contents:
9
10 (*) Abstract memory access model.
11
12 - Device operations.
13 - Guarantees.
14
15 (*) What are memory barriers?
16
17 - Varieties of memory barrier.
18 - What may not be assumed about memory barriers?
19 - Data dependency barriers.
20 - Control dependencies.
21 - SMP barrier pairing.
22 - Examples of memory barrier sequences.
670bd95e 23 - Read memory barriers vs load speculation.
241e6663 24 - Transitivity
108b42b4
DH
25
26 (*) Explicit kernel barriers.
27
28 - Compiler barrier.
81fc6323 29 - CPU memory barriers.
108b42b4
DH
30 - MMIO write barrier.
31
32 (*) Implicit kernel memory barriers.
33
34 - Locking functions.
35 - Interrupt disabling functions.
50fa610a 36 - Sleep and wake-up functions.
108b42b4
DH
37 - Miscellaneous functions.
38
39 (*) Inter-CPU locking barrier effects.
40
41 - Locks vs memory accesses.
42 - Locks vs I/O accesses.
43
44 (*) Where are memory barriers needed?
45
46 - Interprocessor interaction.
47 - Atomic operations.
48 - Accessing devices.
49 - Interrupts.
50
51 (*) Kernel I/O barrier effects.
52
53 (*) Assumed minimum execution ordering model.
54
55 (*) The effects of the cpu cache.
56
57 - Cache coherency.
58 - Cache coherency vs DMA.
59 - Cache coherency vs MMIO.
60
61 (*) The things CPUs get up to.
62
63 - And then there's the Alpha.
64
90fddabf
DH
65 (*) Example uses.
66
67 - Circular buffers.
68
108b42b4
DH
69 (*) References.
70
71
72============================
73ABSTRACT MEMORY ACCESS MODEL
74============================
75
76Consider the following abstract model of the system:
77
78 : :
79 : :
80 : :
81 +-------+ : +--------+ : +-------+
82 | | : | | : | |
83 | | : | | : | |
84 | CPU 1 |<----->| Memory |<----->| CPU 2 |
85 | | : | | : | |
86 | | : | | : | |
87 +-------+ : +--------+ : +-------+
88 ^ : ^ : ^
89 | : | : |
90 | : | : |
91 | : v : |
92 | : +--------+ : |
93 | : | | : |
94 | : | | : |
95 +---------->| Device |<----------+
96 : | | :
97 : | | :
98 : +--------+ :
99 : :
100
101Each CPU executes a program that generates memory access operations. In the
102abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
103perform the memory operations in any order it likes, provided program causality
104appears to be maintained. Similarly, the compiler may also arrange the
105instructions it emits in any order it likes, provided it doesn't affect the
106apparent operation of the program.
107
108So in the above diagram, the effects of the memory operations performed by a
109CPU are perceived by the rest of the system as the operations cross the
110interface between the CPU and rest of the system (the dotted lines).
111
112
113For example, consider the following sequence of events:
114
115 CPU 1 CPU 2
116 =============== ===============
117 { A == 1; B == 2 }
118 A = 3; x = A;
119 B = 4; y = B;
120
121The set of accesses as seen by the memory system in the middle can be arranged
122in 24 different combinations:
123
124 STORE A=3, STORE B=4, x=LOAD A->3, y=LOAD B->4
125 STORE A=3, STORE B=4, y=LOAD B->4, x=LOAD A->3
126 STORE A=3, x=LOAD A->3, STORE B=4, y=LOAD B->4
127 STORE A=3, x=LOAD A->3, y=LOAD B->2, STORE B=4
128 STORE A=3, y=LOAD B->2, STORE B=4, x=LOAD A->3
129 STORE A=3, y=LOAD B->2, x=LOAD A->3, STORE B=4
130 STORE B=4, STORE A=3, x=LOAD A->3, y=LOAD B->4
131 STORE B=4, ...
132 ...
133
134and can thus result in four different combinations of values:
135
136 x == 1, y == 2
137 x == 1, y == 4
138 x == 3, y == 2
139 x == 3, y == 4
140
141
142Furthermore, the stores committed by a CPU to the memory system may not be
143perceived by the loads made by another CPU in the same order as the stores were
144committed.
145
146
147As a further example, consider this sequence of events:
148
149 CPU 1 CPU 2
150 =============== ===============
151 { A == 1, B == 2, C = 3, P == &A, Q == &C }
152 B = 4; Q = P;
153 P = &B D = *Q;
154
155There is an obvious data dependency here, as the value loaded into D depends on
156the address retrieved from P by CPU 2. At the end of the sequence, any of the
157following results are possible:
158
159 (Q == &A) and (D == 1)
160 (Q == &B) and (D == 2)
161 (Q == &B) and (D == 4)
162
163Note that CPU 2 will never try and load C into D because the CPU will load P
164into Q before issuing the load of *Q.
165
166
167DEVICE OPERATIONS
168-----------------
169
170Some devices present their control interfaces as collections of memory
171locations, but the order in which the control registers are accessed is very
172important. For instance, imagine an ethernet card with a set of internal
173registers that are accessed through an address port register (A) and a data
174port register (D). To read internal register 5, the following code might then
175be used:
176
177 *A = 5;
178 x = *D;
179
180but this might show up as either of the following two sequences:
181
182 STORE *A = 5, x = LOAD *D
183 x = LOAD *D, STORE *A = 5
184
185the second of which will almost certainly result in a malfunction, since it set
186the address _after_ attempting to read the register.
187
188
189GUARANTEES
190----------
191
192There are some minimal guarantees that may be expected of a CPU:
193
194 (*) On any given CPU, dependent memory accesses will be issued in order, with
195 respect to itself. This means that for:
196
2ecf8101 197 ACCESS_ONCE(Q) = P; smp_read_barrier_depends(); D = ACCESS_ONCE(*Q);
108b42b4
DH
198
199 the CPU will issue the following memory operations:
200
201 Q = LOAD P, D = LOAD *Q
202
2ecf8101
PM
203 and always in that order. On most systems, smp_read_barrier_depends()
204 does nothing, but it is required for DEC Alpha. The ACCESS_ONCE()
205 is required to prevent compiler mischief. Please note that you
206 should normally use something like rcu_dereference() instead of
207 open-coding smp_read_barrier_depends().
108b42b4
DH
208
209 (*) Overlapping loads and stores within a particular CPU will appear to be
210 ordered within that CPU. This means that for:
211
2ecf8101 212 a = ACCESS_ONCE(*X); ACCESS_ONCE(*X) = b;
108b42b4
DH
213
214 the CPU will only issue the following sequence of memory operations:
215
216 a = LOAD *X, STORE *X = b
217
218 And for:
219
2ecf8101 220 ACCESS_ONCE(*X) = c; d = ACCESS_ONCE(*X);
108b42b4
DH
221
222 the CPU will only issue:
223
224 STORE *X = c, d = LOAD *X
225
fa00e7e1 226 (Loads and stores overlap if they are targeted at overlapping pieces of
108b42b4
DH
227 memory).
228
229And there are a number of things that _must_ or _must_not_ be assumed:
230
2ecf8101
PM
231 (*) It _must_not_ be assumed that the compiler will do what you want with
232 memory references that are not protected by ACCESS_ONCE(). Without
233 ACCESS_ONCE(), the compiler is within its rights to do all sorts
692118da
PM
234 of "creative" transformations, which are covered in the Compiler
235 Barrier section.
2ecf8101 236
108b42b4
DH
237 (*) It _must_not_ be assumed that independent loads and stores will be issued
238 in the order given. This means that for:
239
240 X = *A; Y = *B; *D = Z;
241
242 we may get any of the following sequences:
243
244 X = LOAD *A, Y = LOAD *B, STORE *D = Z
245 X = LOAD *A, STORE *D = Z, Y = LOAD *B
246 Y = LOAD *B, X = LOAD *A, STORE *D = Z
247 Y = LOAD *B, STORE *D = Z, X = LOAD *A
248 STORE *D = Z, X = LOAD *A, Y = LOAD *B
249 STORE *D = Z, Y = LOAD *B, X = LOAD *A
250
251 (*) It _must_ be assumed that overlapping memory accesses may be merged or
252 discarded. This means that for:
253
254 X = *A; Y = *(A + 4);
255
256 we may get any one of the following sequences:
257
258 X = LOAD *A; Y = LOAD *(A + 4);
259 Y = LOAD *(A + 4); X = LOAD *A;
260 {X, Y} = LOAD {*A, *(A + 4) };
261
262 And for:
263
f191eec5 264 *A = X; *(A + 4) = Y;
108b42b4 265
f191eec5 266 we may get any of:
108b42b4 267
f191eec5
PM
268 STORE *A = X; STORE *(A + 4) = Y;
269 STORE *(A + 4) = Y; STORE *A = X;
270 STORE {*A, *(A + 4) } = {X, Y};
108b42b4
DH
271
272
273=========================
274WHAT ARE MEMORY BARRIERS?
275=========================
276
277As can be seen above, independent memory operations are effectively performed
278in random order, but this can be a problem for CPU-CPU interaction and for I/O.
279What is required is some way of intervening to instruct the compiler and the
280CPU to restrict the order.
281
282Memory barriers are such interventions. They impose a perceived partial
2b94895b
DH
283ordering over the memory operations on either side of the barrier.
284
285Such enforcement is important because the CPUs and other devices in a system
81fc6323 286can use a variety of tricks to improve performance, including reordering,
2b94895b
DH
287deferral and combination of memory operations; speculative loads; speculative
288branch prediction and various types of caching. Memory barriers are used to
289override or suppress these tricks, allowing the code to sanely control the
290interaction of multiple CPUs and/or devices.
108b42b4
DH
291
292
293VARIETIES OF MEMORY BARRIER
294---------------------------
295
296Memory barriers come in four basic varieties:
297
298 (1) Write (or store) memory barriers.
299
300 A write memory barrier gives a guarantee that all the STORE operations
301 specified before the barrier will appear to happen before all the STORE
302 operations specified after the barrier with respect to the other
303 components of the system.
304
305 A write barrier is a partial ordering on stores only; it is not required
306 to have any effect on loads.
307
6bc39274 308 A CPU can be viewed as committing a sequence of store operations to the
108b42b4
DH
309 memory system as time progresses. All stores before a write barrier will
310 occur in the sequence _before_ all the stores after the write barrier.
311
312 [!] Note that write barriers should normally be paired with read or data
313 dependency barriers; see the "SMP barrier pairing" subsection.
314
315
316 (2) Data dependency barriers.
317
318 A data dependency barrier is a weaker form of read barrier. In the case
319 where two loads are performed such that the second depends on the result
320 of the first (eg: the first load retrieves the address to which the second
321 load will be directed), a data dependency barrier would be required to
322 make sure that the target of the second load is updated before the address
323 obtained by the first load is accessed.
324
325 A data dependency barrier is a partial ordering on interdependent loads
326 only; it is not required to have any effect on stores, independent loads
327 or overlapping loads.
328
329 As mentioned in (1), the other CPUs in the system can be viewed as
330 committing sequences of stores to the memory system that the CPU being
331 considered can then perceive. A data dependency barrier issued by the CPU
332 under consideration guarantees that for any load preceding it, if that
333 load touches one of a sequence of stores from another CPU, then by the
334 time the barrier completes, the effects of all the stores prior to that
335 touched by the load will be perceptible to any loads issued after the data
336 dependency barrier.
337
338 See the "Examples of memory barrier sequences" subsection for diagrams
339 showing the ordering constraints.
340
341 [!] Note that the first load really has to have a _data_ dependency and
342 not a control dependency. If the address for the second load is dependent
343 on the first load, but the dependency is through a conditional rather than
344 actually loading the address itself, then it's a _control_ dependency and
345 a full read barrier or better is required. See the "Control dependencies"
346 subsection for more information.
347
348 [!] Note that data dependency barriers should normally be paired with
349 write barriers; see the "SMP barrier pairing" subsection.
350
351
352 (3) Read (or load) memory barriers.
353
354 A read barrier is a data dependency barrier plus a guarantee that all the
355 LOAD operations specified before the barrier will appear to happen before
356 all the LOAD operations specified after the barrier with respect to the
357 other components of the system.
358
359 A read barrier is a partial ordering on loads only; it is not required to
360 have any effect on stores.
361
362 Read memory barriers imply data dependency barriers, and so can substitute
363 for them.
364
365 [!] Note that read barriers should normally be paired with write barriers;
366 see the "SMP barrier pairing" subsection.
367
368
369 (4) General memory barriers.
370
670bd95e
DH
371 A general memory barrier gives a guarantee that all the LOAD and STORE
372 operations specified before the barrier will appear to happen before all
373 the LOAD and STORE operations specified after the barrier with respect to
374 the other components of the system.
375
376 A general memory barrier is a partial ordering over both loads and stores.
108b42b4
DH
377
378 General memory barriers imply both read and write memory barriers, and so
379 can substitute for either.
380
381
382And a couple of implicit varieties:
383
384 (5) LOCK operations.
385
386 This acts as a one-way permeable barrier. It guarantees that all memory
387 operations after the LOCK operation will appear to happen after the LOCK
388 operation with respect to the other components of the system.
389
390 Memory operations that occur before a LOCK operation may appear to happen
391 after it completes.
392
393 A LOCK operation should almost always be paired with an UNLOCK operation.
394
395
396 (6) UNLOCK operations.
397
398 This also acts as a one-way permeable barrier. It guarantees that all
399 memory operations before the UNLOCK operation will appear to happen before
400 the UNLOCK operation with respect to the other components of the system.
401
402 Memory operations that occur after an UNLOCK operation may appear to
403 happen before it completes.
404
108b42b4
DH
405 The use of LOCK and UNLOCK operations generally precludes the need for
406 other sorts of memory barrier (but note the exceptions mentioned in the
17eb88e0
PM
407 subsection "MMIO write barrier"). In addition, an UNLOCK+LOCK pair
408 is -not- guaranteed to act as a full memory barrier. However,
409 after a LOCK on a given lock variable, all memory accesses preceding any
410 prior UNLOCK on that same variable are guaranteed to be visible.
411 In other words, within a given lock variable's critical section,
412 all accesses of all previous critical sections for that lock variable
413 are guaranteed to have completed.
414
415 This means that LOCK acts as a minimal "acquire" operation and
416 UNLOCK acts as a minimal "release" operation.
108b42b4
DH
417
418
419Memory barriers are only required where there's a possibility of interaction
420between two CPUs or between a CPU and a device. If it can be guaranteed that
421there won't be any such interaction in any particular piece of code, then
422memory barriers are unnecessary in that piece of code.
423
424
425Note that these are the _minimum_ guarantees. Different architectures may give
426more substantial guarantees, but they may _not_ be relied upon outside of arch
427specific code.
428
429
430WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS?
431----------------------------------------------
432
433There are certain things that the Linux kernel memory barriers do not guarantee:
434
435 (*) There is no guarantee that any of the memory accesses specified before a
436 memory barrier will be _complete_ by the completion of a memory barrier
437 instruction; the barrier can be considered to draw a line in that CPU's
438 access queue that accesses of the appropriate type may not cross.
439
440 (*) There is no guarantee that issuing a memory barrier on one CPU will have
441 any direct effect on another CPU or any other hardware in the system. The
442 indirect effect will be the order in which the second CPU sees the effects
443 of the first CPU's accesses occur, but see the next point:
444
6bc39274 445 (*) There is no guarantee that a CPU will see the correct order of effects
108b42b4
DH
446 from a second CPU's accesses, even _if_ the second CPU uses a memory
447 barrier, unless the first CPU _also_ uses a matching memory barrier (see
448 the subsection on "SMP Barrier Pairing").
449
450 (*) There is no guarantee that some intervening piece of off-the-CPU
451 hardware[*] will not reorder the memory accesses. CPU cache coherency
452 mechanisms should propagate the indirect effects of a memory barrier
453 between CPUs, but might not do so in order.
454
455 [*] For information on bus mastering DMA and coherency please read:
456
4b5ff469 457 Documentation/PCI/pci.txt
395cf969 458 Documentation/DMA-API-HOWTO.txt
108b42b4
DH
459 Documentation/DMA-API.txt
460
461
462DATA DEPENDENCY BARRIERS
463------------------------
464
465The usage requirements of data dependency barriers are a little subtle, and
466it's not always obvious that they're needed. To illustrate, consider the
467following sequence of events:
468
2ecf8101
PM
469 CPU 1 CPU 2
470 =============== ===============
108b42b4
DH
471 { A == 1, B == 2, C = 3, P == &A, Q == &C }
472 B = 4;
473 <write barrier>
2ecf8101
PM
474 ACCESS_ONCE(P) = &B
475 Q = ACCESS_ONCE(P);
476 D = *Q;
108b42b4
DH
477
478There's a clear data dependency here, and it would seem that by the end of the
479sequence, Q must be either &A or &B, and that:
480
481 (Q == &A) implies (D == 1)
482 (Q == &B) implies (D == 4)
483
81fc6323 484But! CPU 2's perception of P may be updated _before_ its perception of B, thus
108b42b4
DH
485leading to the following situation:
486
487 (Q == &B) and (D == 2) ????
488
489Whilst this may seem like a failure of coherency or causality maintenance, it
490isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
491Alpha).
492
2b94895b
DH
493To deal with this, a data dependency barrier or better must be inserted
494between the address load and the data load:
108b42b4 495
2ecf8101
PM
496 CPU 1 CPU 2
497 =============== ===============
108b42b4
DH
498 { A == 1, B == 2, C = 3, P == &A, Q == &C }
499 B = 4;
500 <write barrier>
2ecf8101
PM
501 ACCESS_ONCE(P) = &B
502 Q = ACCESS_ONCE(P);
503 <data dependency barrier>
504 D = *Q;
108b42b4
DH
505
506This enforces the occurrence of one of the two implications, and prevents the
507third possibility from arising.
508
509[!] Note that this extremely counterintuitive situation arises most easily on
510machines with split caches, so that, for example, one cache bank processes
511even-numbered cache lines and the other bank processes odd-numbered cache
512lines. The pointer P might be stored in an odd-numbered cache line, and the
513variable B might be stored in an even-numbered cache line. Then, if the
514even-numbered bank of the reading CPU's cache is extremely busy while the
515odd-numbered bank is idle, one can see the new value of the pointer P (&B),
6bc39274 516but the old value of the variable B (2).
108b42b4
DH
517
518
e0edc78f 519Another example of where data dependency barriers might be required is where a
108b42b4
DH
520number is read from memory and then used to calculate the index for an array
521access:
522
2ecf8101
PM
523 CPU 1 CPU 2
524 =============== ===============
108b42b4
DH
525 { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
526 M[1] = 4;
527 <write barrier>
2ecf8101
PM
528 ACCESS_ONCE(P) = 1
529 Q = ACCESS_ONCE(P);
530 <data dependency barrier>
531 D = M[Q];
108b42b4
DH
532
533
2ecf8101
PM
534The data dependency barrier is very important to the RCU system,
535for example. See rcu_assign_pointer() and rcu_dereference() in
536include/linux/rcupdate.h. This permits the current target of an RCU'd
537pointer to be replaced with a new modified target, without the replacement
538target appearing to be incompletely initialised.
108b42b4
DH
539
540See also the subsection on "Cache Coherency" for a more thorough example.
541
542
543CONTROL DEPENDENCIES
544--------------------
545
546A control dependency requires a full read memory barrier, not simply a data
547dependency barrier to make it work correctly. Consider the following bit of
548code:
549
2ecf8101 550 q = ACCESS_ONCE(a);
18c03c61
PZ
551 if (q) {
552 <data dependency barrier> /* BUG: No data dependency!!! */
553 p = ACCESS_ONCE(b);
45c8a36a 554 }
108b42b4
DH
555
556This will not have the desired effect because there is no actual data
2ecf8101
PM
557dependency, but rather a control dependency that the CPU may short-circuit
558by attempting to predict the outcome in advance, so that other CPUs see
559the load from b as having happened before the load from a. In such a
560case what's actually required is:
108b42b4 561
2ecf8101 562 q = ACCESS_ONCE(a);
18c03c61 563 if (q) {
45c8a36a 564 <read barrier>
18c03c61 565 p = ACCESS_ONCE(b);
45c8a36a 566 }
18c03c61
PZ
567
568However, stores are not speculated. This means that ordering -is- provided
569in the following example:
570
571 q = ACCESS_ONCE(a);
572 if (ACCESS_ONCE(q)) {
573 ACCESS_ONCE(b) = p;
574 }
575
576Please note that ACCESS_ONCE() is not optional! Without the ACCESS_ONCE(),
577the compiler is within its rights to transform this example:
578
579 q = a;
580 if (q) {
581 b = p; /* BUG: Compiler can reorder!!! */
582 do_something();
583 } else {
584 b = p; /* BUG: Compiler can reorder!!! */
585 do_something_else();
586 }
587
588into this, which of course defeats the ordering:
589
590 b = p;
591 q = a;
592 if (q)
593 do_something();
594 else
595 do_something_else();
596
597Worse yet, if the compiler is able to prove (say) that the value of
598variable 'a' is always non-zero, it would be well within its rights
599to optimize the original example by eliminating the "if" statement
600as follows:
601
602 q = a;
603 b = p; /* BUG: Compiler can reorder!!! */
604 do_something();
605
606The solution is again ACCESS_ONCE(), which preserves the ordering between
607the load from variable 'a' and the store to variable 'b':
608
609 q = ACCESS_ONCE(a);
610 if (q) {
611 ACCESS_ONCE(b) = p;
612 do_something();
613 } else {
614 ACCESS_ONCE(b) = p;
615 do_something_else();
616 }
617
618You could also use barrier() to prevent the compiler from moving
619the stores to variable 'b', but barrier() would not prevent the
620compiler from proving to itself that a==1 always, so ACCESS_ONCE()
621is also needed.
622
623It is important to note that control dependencies absolutely require a
624a conditional. For example, the following "optimized" version of
625the above example breaks ordering:
626
627 q = ACCESS_ONCE(a);
628 ACCESS_ONCE(b) = p; /* BUG: No ordering vs. load from a!!! */
629 if (q) {
630 /* ACCESS_ONCE(b) = p; -- moved up, BUG!!! */
631 do_something();
632 } else {
633 /* ACCESS_ONCE(b) = p; -- moved up, BUG!!! */
634 do_something_else();
635 }
636
637It is of course legal for the prior load to be part of the conditional,
638for example, as follows:
639
640 if (ACCESS_ONCE(a) > 0) {
641 ACCESS_ONCE(b) = q / 2;
642 do_something();
643 } else {
644 ACCESS_ONCE(b) = q / 3;
645 do_something_else();
646 }
647
648This will again ensure that the load from variable 'a' is ordered before the
649stores to variable 'b'.
650
651In addition, you need to be careful what you do with the local variable 'q',
652otherwise the compiler might be able to guess the value and again remove
653the needed conditional. For example:
654
655 q = ACCESS_ONCE(a);
656 if (q % MAX) {
657 ACCESS_ONCE(b) = p;
658 do_something();
659 } else {
660 ACCESS_ONCE(b) = p;
661 do_something_else();
662 }
663
664If MAX is defined to be 1, then the compiler knows that (q % MAX) is
665equal to zero, in which case the compiler is within its rights to
666transform the above code into the following:
667
668 q = ACCESS_ONCE(a);
669 ACCESS_ONCE(b) = p;
670 do_something_else();
671
672This transformation loses the ordering between the load from variable 'a'
673and the store to variable 'b'. If you are relying on this ordering, you
674should do something like the following:
675
676 q = ACCESS_ONCE(a);
677 BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */
678 if (q % MAX) {
679 ACCESS_ONCE(b) = p;
680 do_something();
681 } else {
682 ACCESS_ONCE(b) = p;
683 do_something_else();
684 }
685
686Finally, control dependencies do -not- provide transitivity. This is
687demonstrated by two related examples:
688
689 CPU 0 CPU 1
690 ===================== =====================
691 r1 = ACCESS_ONCE(x); r2 = ACCESS_ONCE(y);
692 if (r1 >= 0) if (r2 >= 0)
693 ACCESS_ONCE(y) = 1; ACCESS_ONCE(x) = 1;
694
695 assert(!(r1 == 1 && r2 == 1));
696
697The above two-CPU example will never trigger the assert(). However,
698if control dependencies guaranteed transitivity (which they do not),
699then adding the following two CPUs would guarantee a related assertion:
700
701 CPU 2 CPU 3
702 ===================== =====================
703 ACCESS_ONCE(x) = 2; ACCESS_ONCE(y) = 2;
704
705 assert(!(r1 == 2 && r2 == 2 && x == 1 && y == 1)); /* FAILS!!! */
706
707But because control dependencies do -not- provide transitivity, the
708above assertion can fail after the combined four-CPU example completes.
709If you need the four-CPU example to provide ordering, you will need
710smp_mb() between the loads and stores in the CPU 0 and CPU 1 code fragments.
711
712In summary:
713
714 (*) Control dependencies can order prior loads against later stores.
715 However, they do -not- guarantee any other sort of ordering:
716 Not prior loads against later loads, nor prior stores against
717 later anything. If you need these other forms of ordering,
718 use smb_rmb(), smp_wmb(), or, in the case of prior stores and
719 later loads, smp_mb().
720
721 (*) Control dependencies require at least one run-time conditional
722 between the prior load and the subsequent store. If the compiler
723 is able to optimize the conditional away, it will have also
724 optimized away the ordering. Careful use of ACCESS_ONCE() can
725 help to preserve the needed conditional.
726
727 (*) Control dependencies require that the compiler avoid reordering the
728 dependency into nonexistence. Careful use of ACCESS_ONCE() or
692118da
PM
729 barrier() can help to preserve your control dependency. Please
730 see the Compiler Barrier section for more information.
18c03c61
PZ
731
732 (*) Control dependencies do -not- provide transitivity. If you
733 need transitivity, use smp_mb().
108b42b4
DH
734
735
736SMP BARRIER PAIRING
737-------------------
738
739When dealing with CPU-CPU interactions, certain types of memory barrier should
740always be paired. A lack of appropriate pairing is almost certainly an error.
741
742A write barrier should always be paired with a data dependency barrier or read
743barrier, though a general barrier would also be viable. Similarly a read
744barrier or a data dependency barrier should always be paired with at least an
745write barrier, though, again, a general barrier is viable:
746
2ecf8101
PM
747 CPU 1 CPU 2
748 =============== ===============
749 ACCESS_ONCE(a) = 1;
108b42b4 750 <write barrier>
2ecf8101
PM
751 ACCESS_ONCE(b) = 2; x = ACCESS_ONCE(b);
752 <read barrier>
753 y = ACCESS_ONCE(a);
108b42b4
DH
754
755Or:
756
2ecf8101
PM
757 CPU 1 CPU 2
758 =============== ===============================
108b42b4
DH
759 a = 1;
760 <write barrier>
2ecf8101
PM
761 ACCESS_ONCE(b) = &a; x = ACCESS_ONCE(b);
762 <data dependency barrier>
763 y = *x;
108b42b4
DH
764
765Basically, the read barrier always has to be there, even though it can be of
766the "weaker" type.
767
670bd95e 768[!] Note that the stores before the write barrier would normally be expected to
81fc6323 769match the loads after the read barrier or the data dependency barrier, and vice
670bd95e
DH
770versa:
771
2ecf8101
PM
772 CPU 1 CPU 2
773 =================== ===================
774 ACCESS_ONCE(a) = 1; }---- --->{ v = ACCESS_ONCE(c);
775 ACCESS_ONCE(b) = 2; } \ / { w = ACCESS_ONCE(d);
776 <write barrier> \ <read barrier>
777 ACCESS_ONCE(c) = 3; } / \ { x = ACCESS_ONCE(a);
778 ACCESS_ONCE(d) = 4; }---- --->{ y = ACCESS_ONCE(b);
670bd95e 779
108b42b4
DH
780
781EXAMPLES OF MEMORY BARRIER SEQUENCES
782------------------------------------
783
81fc6323 784Firstly, write barriers act as partial orderings on store operations.
108b42b4
DH
785Consider the following sequence of events:
786
787 CPU 1
788 =======================
789 STORE A = 1
790 STORE B = 2
791 STORE C = 3
792 <write barrier>
793 STORE D = 4
794 STORE E = 5
795
796This sequence of events is committed to the memory coherence system in an order
797that the rest of the system might perceive as the unordered set of { STORE A,
80f7228b 798STORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E
108b42b4
DH
799}:
800
801 +-------+ : :
802 | | +------+
803 | |------>| C=3 | } /\
81fc6323
JP
804 | | : +------+ }----- \ -----> Events perceptible to
805 | | : | A=1 | } \/ the rest of the system
108b42b4
DH
806 | | : +------+ }
807 | CPU 1 | : | B=2 | }
808 | | +------+ }
809 | | wwwwwwwwwwwwwwww } <--- At this point the write barrier
810 | | +------+ } requires all stores prior to the
811 | | : | E=5 | } barrier to be committed before
81fc6323 812 | | : +------+ } further stores may take place
108b42b4
DH
813 | |------>| D=4 | }
814 | | +------+
815 +-------+ : :
816 |
670bd95e
DH
817 | Sequence in which stores are committed to the
818 | memory system by CPU 1
108b42b4
DH
819 V
820
821
81fc6323 822Secondly, data dependency barriers act as partial orderings on data-dependent
108b42b4
DH
823loads. Consider the following sequence of events:
824
825 CPU 1 CPU 2
826 ======================= =======================
c14038c3 827 { B = 7; X = 9; Y = 8; C = &Y }
108b42b4
DH
828 STORE A = 1
829 STORE B = 2
830 <write barrier>
831 STORE C = &B LOAD X
832 STORE D = 4 LOAD C (gets &B)
833 LOAD *C (reads B)
834
835Without intervention, CPU 2 may perceive the events on CPU 1 in some
836effectively random order, despite the write barrier issued by CPU 1:
837
838 +-------+ : : : :
839 | | +------+ +-------+ | Sequence of update
840 | |------>| B=2 |----- --->| Y->8 | | of perception on
841 | | : +------+ \ +-------+ | CPU 2
842 | CPU 1 | : | A=1 | \ --->| C->&Y | V
843 | | +------+ | +-------+
844 | | wwwwwwwwwwwwwwww | : :
845 | | +------+ | : :
846 | | : | C=&B |--- | : : +-------+
847 | | : +------+ \ | +-------+ | |
848 | |------>| D=4 | ----------->| C->&B |------>| |
849 | | +------+ | +-------+ | |
850 +-------+ : : | : : | |
851 | : : | |
852 | : : | CPU 2 |
853 | +-------+ | |
854 Apparently incorrect ---> | | B->7 |------>| |
855 perception of B (!) | +-------+ | |
856 | : : | |
857 | +-------+ | |
858 The load of X holds ---> \ | X->9 |------>| |
859 up the maintenance \ +-------+ | |
860 of coherence of B ----->| B->2 | +-------+
861 +-------+
862 : :
863
864
865In the above example, CPU 2 perceives that B is 7, despite the load of *C
670e9f34 866(which would be B) coming after the LOAD of C.
108b42b4
DH
867
868If, however, a data dependency barrier were to be placed between the load of C
c14038c3
DH
869and the load of *C (ie: B) on CPU 2:
870
871 CPU 1 CPU 2
872 ======================= =======================
873 { B = 7; X = 9; Y = 8; C = &Y }
874 STORE A = 1
875 STORE B = 2
876 <write barrier>
877 STORE C = &B LOAD X
878 STORE D = 4 LOAD C (gets &B)
879 <data dependency barrier>
880 LOAD *C (reads B)
881
882then the following will occur:
108b42b4
DH
883
884 +-------+ : : : :
885 | | +------+ +-------+
886 | |------>| B=2 |----- --->| Y->8 |
887 | | : +------+ \ +-------+
888 | CPU 1 | : | A=1 | \ --->| C->&Y |
889 | | +------+ | +-------+
890 | | wwwwwwwwwwwwwwww | : :
891 | | +------+ | : :
892 | | : | C=&B |--- | : : +-------+
893 | | : +------+ \ | +-------+ | |
894 | |------>| D=4 | ----------->| C->&B |------>| |
895 | | +------+ | +-------+ | |
896 +-------+ : : | : : | |
897 | : : | |
898 | : : | CPU 2 |
899 | +-------+ | |
670bd95e
DH
900 | | X->9 |------>| |
901 | +-------+ | |
902 Makes sure all effects ---> \ ddddddddddddddddd | |
903 prior to the store of C \ +-------+ | |
904 are perceptible to ----->| B->2 |------>| |
905 subsequent loads +-------+ | |
108b42b4
DH
906 : : +-------+
907
908
909And thirdly, a read barrier acts as a partial order on loads. Consider the
910following sequence of events:
911
912 CPU 1 CPU 2
913 ======================= =======================
670bd95e 914 { A = 0, B = 9 }
108b42b4 915 STORE A=1
108b42b4 916 <write barrier>
670bd95e 917 STORE B=2
108b42b4 918 LOAD B
670bd95e 919 LOAD A
108b42b4
DH
920
921Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
922some effectively random order, despite the write barrier issued by CPU 1:
923
670bd95e
DH
924 +-------+ : : : :
925 | | +------+ +-------+
926 | |------>| A=1 |------ --->| A->0 |
927 | | +------+ \ +-------+
928 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 |
929 | | +------+ | +-------+
930 | |------>| B=2 |--- | : :
931 | | +------+ \ | : : +-------+
932 +-------+ : : \ | +-------+ | |
933 ---------->| B->2 |------>| |
934 | +-------+ | CPU 2 |
935 | | A->0 |------>| |
936 | +-------+ | |
937 | : : +-------+
938 \ : :
939 \ +-------+
940 ---->| A->1 |
941 +-------+
942 : :
108b42b4 943
670bd95e 944
6bc39274 945If, however, a read barrier were to be placed between the load of B and the
670bd95e
DH
946load of A on CPU 2:
947
948 CPU 1 CPU 2
949 ======================= =======================
950 { A = 0, B = 9 }
951 STORE A=1
952 <write barrier>
953 STORE B=2
954 LOAD B
955 <read barrier>
956 LOAD A
957
958then the partial ordering imposed by CPU 1 will be perceived correctly by CPU
9592:
960
961 +-------+ : : : :
962 | | +------+ +-------+
963 | |------>| A=1 |------ --->| A->0 |
964 | | +------+ \ +-------+
965 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 |
966 | | +------+ | +-------+
967 | |------>| B=2 |--- | : :
968 | | +------+ \ | : : +-------+
969 +-------+ : : \ | +-------+ | |
970 ---------->| B->2 |------>| |
971 | +-------+ | CPU 2 |
972 | : : | |
973 | : : | |
974 At this point the read ----> \ rrrrrrrrrrrrrrrrr | |
975 barrier causes all effects \ +-------+ | |
976 prior to the storage of B ---->| A->1 |------>| |
977 to be perceptible to CPU 2 +-------+ | |
978 : : +-------+
979
980
981To illustrate this more completely, consider what could happen if the code
982contained a load of A either side of the read barrier:
983
984 CPU 1 CPU 2
985 ======================= =======================
986 { A = 0, B = 9 }
987 STORE A=1
988 <write barrier>
989 STORE B=2
990 LOAD B
991 LOAD A [first load of A]
992 <read barrier>
993 LOAD A [second load of A]
994
995Even though the two loads of A both occur after the load of B, they may both
996come up with different values:
997
998 +-------+ : : : :
999 | | +------+ +-------+
1000 | |------>| A=1 |------ --->| A->0 |
1001 | | +------+ \ +-------+
1002 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 |
1003 | | +------+ | +-------+
1004 | |------>| B=2 |--- | : :
1005 | | +------+ \ | : : +-------+
1006 +-------+ : : \ | +-------+ | |
1007 ---------->| B->2 |------>| |
1008 | +-------+ | CPU 2 |
1009 | : : | |
1010 | : : | |
1011 | +-------+ | |
1012 | | A->0 |------>| 1st |
1013 | +-------+ | |
1014 At this point the read ----> \ rrrrrrrrrrrrrrrrr | |
1015 barrier causes all effects \ +-------+ | |
1016 prior to the storage of B ---->| A->1 |------>| 2nd |
1017 to be perceptible to CPU 2 +-------+ | |
1018 : : +-------+
1019
1020
1021But it may be that the update to A from CPU 1 becomes perceptible to CPU 2
1022before the read barrier completes anyway:
1023
1024 +-------+ : : : :
1025 | | +------+ +-------+
1026 | |------>| A=1 |------ --->| A->0 |
1027 | | +------+ \ +-------+
1028 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 |
1029 | | +------+ | +-------+
1030 | |------>| B=2 |--- | : :
1031 | | +------+ \ | : : +-------+
1032 +-------+ : : \ | +-------+ | |
1033 ---------->| B->2 |------>| |
1034 | +-------+ | CPU 2 |
1035 | : : | |
1036 \ : : | |
1037 \ +-------+ | |
1038 ---->| A->1 |------>| 1st |
1039 +-------+ | |
1040 rrrrrrrrrrrrrrrrr | |
1041 +-------+ | |
1042 | A->1 |------>| 2nd |
1043 +-------+ | |
1044 : : +-------+
1045
1046
1047The guarantee is that the second load will always come up with A == 1 if the
1048load of B came up with B == 2. No such guarantee exists for the first load of
1049A; that may come up with either A == 0 or A == 1.
1050
1051
1052READ MEMORY BARRIERS VS LOAD SPECULATION
1053----------------------------------------
1054
1055Many CPUs speculate with loads: that is they see that they will need to load an
1056item from memory, and they find a time where they're not using the bus for any
1057other loads, and so do the load in advance - even though they haven't actually
1058got to that point in the instruction execution flow yet. This permits the
1059actual load instruction to potentially complete immediately because the CPU
1060already has the value to hand.
1061
1062It may turn out that the CPU didn't actually need the value - perhaps because a
1063branch circumvented the load - in which case it can discard the value or just
1064cache it for later use.
1065
1066Consider:
1067
e0edc78f 1068 CPU 1 CPU 2
670bd95e 1069 ======================= =======================
e0edc78f
IM
1070 LOAD B
1071 DIVIDE } Divide instructions generally
1072 DIVIDE } take a long time to perform
1073 LOAD A
670bd95e
DH
1074
1075Which might appear as this:
1076
1077 : : +-------+
1078 +-------+ | |
1079 --->| B->2 |------>| |
1080 +-------+ | CPU 2 |
1081 : :DIVIDE | |
1082 +-------+ | |
1083 The CPU being busy doing a ---> --->| A->0 |~~~~ | |
1084 division speculates on the +-------+ ~ | |
1085 LOAD of A : : ~ | |
1086 : :DIVIDE | |
1087 : : ~ | |
1088 Once the divisions are complete --> : : ~-->| |
1089 the CPU can then perform the : : | |
1090 LOAD with immediate effect : : +-------+
1091
1092
1093Placing a read barrier or a data dependency barrier just before the second
1094load:
1095
e0edc78f 1096 CPU 1 CPU 2
670bd95e 1097 ======================= =======================
e0edc78f
IM
1098 LOAD B
1099 DIVIDE
1100 DIVIDE
670bd95e 1101 <read barrier>
e0edc78f 1102 LOAD A
670bd95e
DH
1103
1104will force any value speculatively obtained to be reconsidered to an extent
1105dependent on the type of barrier used. If there was no change made to the
1106speculated memory location, then the speculated value will just be used:
1107
1108 : : +-------+
1109 +-------+ | |
1110 --->| B->2 |------>| |
1111 +-------+ | CPU 2 |
1112 : :DIVIDE | |
1113 +-------+ | |
1114 The CPU being busy doing a ---> --->| A->0 |~~~~ | |
1115 division speculates on the +-------+ ~ | |
1116 LOAD of A : : ~ | |
1117 : :DIVIDE | |
1118 : : ~ | |
1119 : : ~ | |
1120 rrrrrrrrrrrrrrrr~ | |
1121 : : ~ | |
1122 : : ~-->| |
1123 : : | |
1124 : : +-------+
1125
1126
1127but if there was an update or an invalidation from another CPU pending, then
1128the speculation will be cancelled and the value reloaded:
1129
1130 : : +-------+
1131 +-------+ | |
1132 --->| B->2 |------>| |
1133 +-------+ | CPU 2 |
1134 : :DIVIDE | |
1135 +-------+ | |
1136 The CPU being busy doing a ---> --->| A->0 |~~~~ | |
1137 division speculates on the +-------+ ~ | |
1138 LOAD of A : : ~ | |
1139 : :DIVIDE | |
1140 : : ~ | |
1141 : : ~ | |
1142 rrrrrrrrrrrrrrrrr | |
1143 +-------+ | |
1144 The speculation is discarded ---> --->| A->1 |------>| |
1145 and an updated value is +-------+ | |
1146 retrieved : : +-------+
108b42b4
DH
1147
1148
241e6663
PM
1149TRANSITIVITY
1150------------
1151
1152Transitivity is a deeply intuitive notion about ordering that is not
1153always provided by real computer systems. The following example
1154demonstrates transitivity (also called "cumulativity"):
1155
1156 CPU 1 CPU 2 CPU 3
1157 ======================= ======================= =======================
1158 { X = 0, Y = 0 }
1159 STORE X=1 LOAD X STORE Y=1
1160 <general barrier> <general barrier>
1161 LOAD Y LOAD X
1162
1163Suppose that CPU 2's load from X returns 1 and its load from Y returns 0.
1164This indicates that CPU 2's load from X in some sense follows CPU 1's
1165store to X and that CPU 2's load from Y in some sense preceded CPU 3's
1166store to Y. The question is then "Can CPU 3's load from X return 0?"
1167
1168Because CPU 2's load from X in some sense came after CPU 1's store, it
1169is natural to expect that CPU 3's load from X must therefore return 1.
1170This expectation is an example of transitivity: if a load executing on
1171CPU A follows a load from the same variable executing on CPU B, then
1172CPU A's load must either return the same value that CPU B's load did,
1173or must return some later value.
1174
1175In the Linux kernel, use of general memory barriers guarantees
1176transitivity. Therefore, in the above example, if CPU 2's load from X
1177returns 1 and its load from Y returns 0, then CPU 3's load from X must
1178also return 1.
1179
1180However, transitivity is -not- guaranteed for read or write barriers.
1181For example, suppose that CPU 2's general barrier in the above example
1182is changed to a read barrier as shown below:
1183
1184 CPU 1 CPU 2 CPU 3
1185 ======================= ======================= =======================
1186 { X = 0, Y = 0 }
1187 STORE X=1 LOAD X STORE Y=1
1188 <read barrier> <general barrier>
1189 LOAD Y LOAD X
1190
1191This substitution destroys transitivity: in this example, it is perfectly
1192legal for CPU 2's load from X to return 1, its load from Y to return 0,
1193and CPU 3's load from X to return 0.
1194
1195The key point is that although CPU 2's read barrier orders its pair
1196of loads, it does not guarantee to order CPU 1's store. Therefore, if
1197this example runs on a system where CPUs 1 and 2 share a store buffer
1198or a level of cache, CPU 2 might have early access to CPU 1's writes.
1199General barriers are therefore required to ensure that all CPUs agree
1200on the combined order of CPU 1's and CPU 2's accesses.
1201
1202To reiterate, if your code requires transitivity, use general barriers
1203throughout.
1204
1205
108b42b4
DH
1206========================
1207EXPLICIT KERNEL BARRIERS
1208========================
1209
1210The Linux kernel has a variety of different barriers that act at different
1211levels:
1212
1213 (*) Compiler barrier.
1214
1215 (*) CPU memory barriers.
1216
1217 (*) MMIO write barrier.
1218
1219
1220COMPILER BARRIER
1221----------------
1222
1223The Linux kernel has an explicit compiler barrier function that prevents the
1224compiler from moving the memory accesses either side of it to the other side:
1225
1226 barrier();
1227
18c03c61 1228This is a general barrier -- there are no read-read or write-write variants
692118da 1229of barrier(). However, ACCESS_ONCE() can be thought of as a weak form
18c03c61
PZ
1230for barrier() that affects only the specific accesses flagged by the
1231ACCESS_ONCE().
108b42b4 1232
692118da
PM
1233The barrier() function has the following effects:
1234
1235 (*) Prevents the compiler from reordering accesses following the
1236 barrier() to precede any accesses preceding the barrier().
1237 One example use for this property is to ease communication between
1238 interrupt-handler code and the code that was interrupted.
1239
1240 (*) Within a loop, forces the compiler to load the variables used
1241 in that loop's conditional on each pass through that loop.
1242
1243The ACCESS_ONCE() function can prevent any number of optimizations that,
1244while perfectly safe in single-threaded code, can be fatal in concurrent
1245code. Here are some examples of these sorts of optimizations:
1246
1247 (*) The compiler is within its rights to merge successive loads from
1248 the same variable. Such merging can cause the compiler to "optimize"
1249 the following code:
1250
1251 while (tmp = a)
1252 do_something_with(tmp);
1253
1254 into the following code, which, although in some sense legitimate
1255 for single-threaded code, is almost certainly not what the developer
1256 intended:
1257
1258 if (tmp = a)
1259 for (;;)
1260 do_something_with(tmp);
1261
1262 Use ACCESS_ONCE() to prevent the compiler from doing this to you:
1263
1264 while (tmp = ACCESS_ONCE(a))
1265 do_something_with(tmp);
1266
1267 (*) The compiler is within its rights to reload a variable, for example,
1268 in cases where high register pressure prevents the compiler from
1269 keeping all data of interest in registers. The compiler might
1270 therefore optimize the variable 'tmp' out of our previous example:
1271
1272 while (tmp = a)
1273 do_something_with(tmp);
1274
1275 This could result in the following code, which is perfectly safe in
1276 single-threaded code, but can be fatal in concurrent code:
1277
1278 while (a)
1279 do_something_with(a);
1280
1281 For example, the optimized version of this code could result in
1282 passing a zero to do_something_with() in the case where the variable
1283 a was modified by some other CPU between the "while" statement and
1284 the call to do_something_with().
1285
1286 Again, use ACCESS_ONCE() to prevent the compiler from doing this:
1287
1288 while (tmp = ACCESS_ONCE(a))
1289 do_something_with(tmp);
1290
1291 Note that if the compiler runs short of registers, it might save
1292 tmp onto the stack. The overhead of this saving and later restoring
1293 is why compilers reload variables. Doing so is perfectly safe for
1294 single-threaded code, so you need to tell the compiler about cases
1295 where it is not safe.
1296
1297 (*) The compiler is within its rights to omit a load entirely if it knows
1298 what the value will be. For example, if the compiler can prove that
1299 the value of variable 'a' is always zero, it can optimize this code:
1300
1301 while (tmp = a)
1302 do_something_with(tmp);
1303
1304 Into this:
1305
1306 do { } while (0);
1307
1308 This transformation is a win for single-threaded code because it gets
1309 rid of a load and a branch. The problem is that the compiler will
1310 carry out its proof assuming that the current CPU is the only one
1311 updating variable 'a'. If variable 'a' is shared, then the compiler's
1312 proof will be erroneous. Use ACCESS_ONCE() to tell the compiler
1313 that it doesn't know as much as it thinks it does:
1314
1315 while (tmp = ACCESS_ONCE(a))
1316 do_something_with(tmp);
1317
1318 But please note that the compiler is also closely watching what you
1319 do with the value after the ACCESS_ONCE(). For example, suppose you
1320 do the following and MAX is a preprocessor macro with the value 1:
1321
1322 while ((tmp = ACCESS_ONCE(a)) % MAX)
1323 do_something_with(tmp);
1324
1325 Then the compiler knows that the result of the "%" operator applied
1326 to MAX will always be zero, again allowing the compiler to optimize
1327 the code into near-nonexistence. (It will still load from the
1328 variable 'a'.)
1329
1330 (*) Similarly, the compiler is within its rights to omit a store entirely
1331 if it knows that the variable already has the value being stored.
1332 Again, the compiler assumes that the current CPU is the only one
1333 storing into the variable, which can cause the compiler to do the
1334 wrong thing for shared variables. For example, suppose you have
1335 the following:
1336
1337 a = 0;
1338 /* Code that does not store to variable a. */
1339 a = 0;
1340
1341 The compiler sees that the value of variable 'a' is already zero, so
1342 it might well omit the second store. This would come as a fatal
1343 surprise if some other CPU might have stored to variable 'a' in the
1344 meantime.
1345
1346 Use ACCESS_ONCE() to prevent the compiler from making this sort of
1347 wrong guess:
1348
1349 ACCESS_ONCE(a) = 0;
1350 /* Code that does not store to variable a. */
1351 ACCESS_ONCE(a) = 0;
1352
1353 (*) The compiler is within its rights to reorder memory accesses unless
1354 you tell it not to. For example, consider the following interaction
1355 between process-level code and an interrupt handler:
1356
1357 void process_level(void)
1358 {
1359 msg = get_message();
1360 flag = true;
1361 }
1362
1363 void interrupt_handler(void)
1364 {
1365 if (flag)
1366 process_message(msg);
1367 }
1368
1369 There is nothing to prevent the the compiler from transforming
1370 process_level() to the following, in fact, this might well be a
1371 win for single-threaded code:
1372
1373 void process_level(void)
1374 {
1375 flag = true;
1376 msg = get_message();
1377 }
1378
1379 If the interrupt occurs between these two statement, then
1380 interrupt_handler() might be passed a garbled msg. Use ACCESS_ONCE()
1381 to prevent this as follows:
1382
1383 void process_level(void)
1384 {
1385 ACCESS_ONCE(msg) = get_message();
1386 ACCESS_ONCE(flag) = true;
1387 }
1388
1389 void interrupt_handler(void)
1390 {
1391 if (ACCESS_ONCE(flag))
1392 process_message(ACCESS_ONCE(msg));
1393 }
1394
1395 Note that the ACCESS_ONCE() wrappers in interrupt_handler()
1396 are needed if this interrupt handler can itself be interrupted
1397 by something that also accesses 'flag' and 'msg', for example,
1398 a nested interrupt or an NMI. Otherwise, ACCESS_ONCE() is not
1399 needed in interrupt_handler() other than for documentation purposes.
1400 (Note also that nested interrupts do not typically occur in modern
1401 Linux kernels, in fact, if an interrupt handler returns with
1402 interrupts enabled, you will get a WARN_ONCE() splat.)
1403
1404 You should assume that the compiler can move ACCESS_ONCE() past
1405 code not containing ACCESS_ONCE(), barrier(), or similar primitives.
1406
1407 This effect could also be achieved using barrier(), but ACCESS_ONCE()
1408 is more selective: With ACCESS_ONCE(), the compiler need only forget
1409 the contents of the indicated memory locations, while with barrier()
1410 the compiler must discard the value of all memory locations that
1411 it has currented cached in any machine registers. Of course,
1412 the compiler must also respect the order in which the ACCESS_ONCE()s
1413 occur, though the CPU of course need not do so.
1414
1415 (*) The compiler is within its rights to invent stores to a variable,
1416 as in the following example:
1417
1418 if (a)
1419 b = a;
1420 else
1421 b = 42;
1422
1423 The compiler might save a branch by optimizing this as follows:
1424
1425 b = 42;
1426 if (a)
1427 b = a;
1428
1429 In single-threaded code, this is not only safe, but also saves
1430 a branch. Unfortunately, in concurrent code, this optimization
1431 could cause some other CPU to see a spurious value of 42 -- even
1432 if variable 'a' was never zero -- when loading variable 'b'.
1433 Use ACCESS_ONCE() to prevent this as follows:
1434
1435 if (a)
1436 ACCESS_ONCE(b) = a;
1437 else
1438 ACCESS_ONCE(b) = 42;
1439
1440 The compiler can also invent loads. These are usually less
1441 damaging, but they can result in cache-line bouncing and thus in
1442 poor performance and scalability. Use ACCESS_ONCE() to prevent
1443 invented loads.
1444
1445 (*) For aligned memory locations whose size allows them to be accessed
1446 with a single memory-reference instruction, prevents "load tearing"
1447 and "store tearing," in which a single large access is replaced by
1448 multiple smaller accesses. For example, given an architecture having
1449 16-bit store instructions with 7-bit immediate fields, the compiler
1450 might be tempted to use two 16-bit store-immediate instructions to
1451 implement the following 32-bit store:
1452
1453 p = 0x00010002;
1454
1455 Please note that GCC really does use this sort of optimization,
1456 which is not surprising given that it would likely take more
1457 than two instructions to build the constant and then store it.
1458 This optimization can therefore be a win in single-threaded code.
1459 In fact, a recent bug (since fixed) caused GCC to incorrectly use
1460 this optimization in a volatile store. In the absence of such bugs,
1461 use of ACCESS_ONCE() prevents store tearing in the following example:
1462
1463 ACCESS_ONCE(p) = 0x00010002;
1464
1465 Use of packed structures can also result in load and store tearing,
1466 as in this example:
1467
1468 struct __attribute__((__packed__)) foo {
1469 short a;
1470 int b;
1471 short c;
1472 };
1473 struct foo foo1, foo2;
1474 ...
1475
1476 foo2.a = foo1.a;
1477 foo2.b = foo1.b;
1478 foo2.c = foo1.c;
1479
1480 Because there are no ACCESS_ONCE() wrappers and no volatile markings,
1481 the compiler would be well within its rights to implement these three
1482 assignment statements as a pair of 32-bit loads followed by a pair
1483 of 32-bit stores. This would result in load tearing on 'foo1.b'
1484 and store tearing on 'foo2.b'. ACCESS_ONCE() again prevents tearing
1485 in this example:
1486
1487 foo2.a = foo1.a;
1488 ACCESS_ONCE(foo2.b) = ACCESS_ONCE(foo1.b);
1489 foo2.c = foo1.c;
1490
1491All that aside, it is never necessary to use ACCESS_ONCE() on a variable
1492that has been marked volatile. For example, because 'jiffies' is marked
1493volatile, it is never necessary to say ACCESS_ONCE(jiffies). The reason
1494for this is that ACCESS_ONCE() is implemented as a volatile cast, which
1495has no effect when its argument is already marked volatile.
1496
1497Please note that these compiler barriers have no direct effect on the CPU,
1498which may then reorder things however it wishes.
108b42b4
DH
1499
1500
1501CPU MEMORY BARRIERS
1502-------------------
1503
1504The Linux kernel has eight basic CPU memory barriers:
1505
1506 TYPE MANDATORY SMP CONDITIONAL
1507 =============== ======================= ===========================
1508 GENERAL mb() smp_mb()
1509 WRITE wmb() smp_wmb()
1510 READ rmb() smp_rmb()
1511 DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends()
1512
1513
73f10281
NP
1514All memory barriers except the data dependency barriers imply a compiler
1515barrier. Data dependencies do not impose any additional compiler ordering.
1516
1517Aside: In the case of data dependencies, the compiler would be expected to
1518issue the loads in the correct order (eg. `a[b]` would have to load the value
1519of b before loading a[b]), however there is no guarantee in the C specification
1520that the compiler may not speculate the value of b (eg. is equal to 1) and load
1521a before b (eg. tmp = a[1]; if (b != 1) tmp = a[b]; ). There is also the
1522problem of a compiler reloading b after having loaded a[b], thus having a newer
1523copy of b than a[b]. A consensus has not yet been reached about these problems,
1524however the ACCESS_ONCE macro is a good place to start looking.
108b42b4
DH
1525
1526SMP memory barriers are reduced to compiler barriers on uniprocessor compiled
81fc6323 1527systems because it is assumed that a CPU will appear to be self-consistent,
108b42b4
DH
1528and will order overlapping accesses correctly with respect to itself.
1529
1530[!] Note that SMP memory barriers _must_ be used to control the ordering of
1531references to shared memory on SMP systems, though the use of locking instead
1532is sufficient.
1533
1534Mandatory barriers should not be used to control SMP effects, since mandatory
1535barriers unnecessarily impose overhead on UP systems. They may, however, be
1536used to control MMIO effects on accesses through relaxed memory I/O windows.
1537These are required even on non-SMP systems as they affect the order in which
1538memory operations appear to a device by prohibiting both the compiler and the
1539CPU from reordering them.
1540
1541
1542There are some more advanced barrier functions:
1543
1544 (*) set_mb(var, value)
108b42b4 1545
75b2bd55 1546 This assigns the value to the variable and then inserts a full memory
f92213ba 1547 barrier after it, depending on the function. It isn't guaranteed to
108b42b4
DH
1548 insert anything more than a compiler barrier in a UP compilation.
1549
1550
1551 (*) smp_mb__before_atomic_dec();
1552 (*) smp_mb__after_atomic_dec();
1553 (*) smp_mb__before_atomic_inc();
1554 (*) smp_mb__after_atomic_inc();
1555
1556 These are for use with atomic add, subtract, increment and decrement
dbc8700e
DH
1557 functions that don't return a value, especially when used for reference
1558 counting. These functions do not imply memory barriers.
108b42b4
DH
1559
1560 As an example, consider a piece of code that marks an object as being dead
1561 and then decrements the object's reference count:
1562
1563 obj->dead = 1;
1564 smp_mb__before_atomic_dec();
1565 atomic_dec(&obj->ref_count);
1566
1567 This makes sure that the death mark on the object is perceived to be set
1568 *before* the reference counter is decremented.
1569
1570 See Documentation/atomic_ops.txt for more information. See the "Atomic
1571 operations" subsection for information on where to use these.
1572
1573
1574 (*) smp_mb__before_clear_bit(void);
1575 (*) smp_mb__after_clear_bit(void);
1576
1577 These are for use similar to the atomic inc/dec barriers. These are
1578 typically used for bitwise unlocking operations, so care must be taken as
1579 there are no implicit memory barriers here either.
1580
1581 Consider implementing an unlock operation of some nature by clearing a
1582 locking bit. The clear_bit() would then need to be barriered like this:
1583
1584 smp_mb__before_clear_bit();
1585 clear_bit( ... );
1586
1587 This prevents memory operations before the clear leaking to after it. See
1588 the subsection on "Locking Functions" with reference to UNLOCK operation
1589 implications.
1590
1591 See Documentation/atomic_ops.txt for more information. See the "Atomic
1592 operations" subsection for information on where to use these.
1593
1594
1595MMIO WRITE BARRIER
1596------------------
1597
1598The Linux kernel also has a special barrier for use with memory-mapped I/O
1599writes:
1600
1601 mmiowb();
1602
1603This is a variation on the mandatory write barrier that causes writes to weakly
1604ordered I/O regions to be partially ordered. Its effects may go beyond the
1605CPU->Hardware interface and actually affect the hardware at some level.
1606
1607See the subsection "Locks vs I/O accesses" for more information.
1608
1609
1610===============================
1611IMPLICIT KERNEL MEMORY BARRIERS
1612===============================
1613
1614Some of the other functions in the linux kernel imply memory barriers, amongst
670bd95e 1615which are locking and scheduling functions.
108b42b4
DH
1616
1617This specification is a _minimum_ guarantee; any particular architecture may
1618provide more substantial guarantees, but these may not be relied upon outside
1619of arch specific code.
1620
1621
1622LOCKING FUNCTIONS
1623-----------------
1624
1625The Linux kernel has a number of locking constructs:
1626
1627 (*) spin locks
1628 (*) R/W spin locks
1629 (*) mutexes
1630 (*) semaphores
1631 (*) R/W semaphores
1632 (*) RCU
1633
1634In all cases there are variants on "LOCK" operations and "UNLOCK" operations
1635for each construct. These operations all imply certain barriers:
1636
1637 (1) LOCK operation implication:
1638
1639 Memory operations issued after the LOCK will be completed after the LOCK
1640 operation has completed.
1641
17eb88e0
PM
1642 Memory operations issued before the LOCK may be completed after the
1643 LOCK operation has completed. An smp_mb__before_spinlock(), combined
1644 with a following LOCK, orders prior loads against subsequent stores
1645 and stores and prior stores against subsequent stores. Note that
1646 this is weaker than smp_mb()! The smp_mb__before_spinlock()
1647 primitive is free on many architectures.
108b42b4
DH
1648
1649 (2) UNLOCK operation implication:
1650
1651 Memory operations issued before the UNLOCK will be completed before the
1652 UNLOCK operation has completed.
1653
1654 Memory operations issued after the UNLOCK may be completed before the
1655 UNLOCK operation has completed.
1656
1657 (3) LOCK vs LOCK implication:
1658
1659 All LOCK operations issued before another LOCK operation will be completed
1660 before that LOCK operation.
1661
1662 (4) LOCK vs UNLOCK implication:
1663
1664 All LOCK operations issued before an UNLOCK operation will be completed
1665 before the UNLOCK operation.
1666
108b42b4
DH
1667 (5) Failed conditional LOCK implication:
1668
1669 Certain variants of the LOCK operation may fail, either due to being
1670 unable to get the lock immediately, or due to receiving an unblocked
1671 signal whilst asleep waiting for the lock to become available. Failed
1672 locks do not imply any sort of barrier.
1673
81fc6323
JP
1674[!] Note: one of the consequences of LOCKs and UNLOCKs being only one-way
1675 barriers is that the effects of instructions outside of a critical section
1676 may seep into the inside of the critical section.
108b42b4 1677
670bd95e
DH
1678A LOCK followed by an UNLOCK may not be assumed to be full memory barrier
1679because it is possible for an access preceding the LOCK to happen after the
1680LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the
1681two accesses can themselves then cross:
1682
1683 *A = a;
17eb88e0
PM
1684 LOCK M
1685 UNLOCK M
670bd95e
DH
1686 *B = b;
1687
1688may occur as:
1689
17eb88e0
PM
1690 LOCK M, STORE *B, STORE *A, UNLOCK M
1691
1692This same reordering can of course occur if the LOCK and UNLOCK are
1693to the same lock variable, but only from the perspective of another
1694CPU not holding that lock.
1695
1696In short, an UNLOCK followed by a LOCK may -not- be assumed to be a full
1697memory barrier because it is possible for a preceding UNLOCK to pass a
1698later LOCK from the viewpoint of the CPU, but not from the viewpoint
1699of the compiler. Note that deadlocks cannot be introduced by this
1700interchange because if such a deadlock threatened, the UNLOCK would
1701simply complete.
1702
1703If it is necessary for an UNLOCK-LOCK pair to produce a full barrier,
1704the LOCK can be followed by an smp_mb__after_unlock_lock() invocation.
1705This will produce a full barrier if either (a) the UNLOCK and the LOCK
1706are executed by the same CPU or task, or (b) the UNLOCK and LOCK act
1707on the same lock variable. The smp_mb__after_unlock_lock() primitive
1708is free on many architectures. Without smp_mb__after_unlock_lock(),
1709the critical sections corresponding to the UNLOCK and the LOCK can cross:
1710
1711 *A = a;
1712 UNLOCK M
1713 LOCK N
1714 *B = b;
1715
1716could occur as:
1717
1718 LOCK N, STORE *B, STORE *A, UNLOCK M
1719
1720With smp_mb__after_unlock_lock(), they cannot, so that:
1721
1722 *A = a;
1723 UNLOCK M
1724 LOCK N
1725 smp_mb__after_unlock_lock();
1726 *B = b;
1727
1728will always occur as either of the following:
1729
1730 STORE *A, UNLOCK, LOCK, STORE *B
1731 STORE *A, LOCK, UNLOCK, STORE *B
1732
1733If the UNLOCK and LOCK were instead both operating on the same lock
1734variable, only the first of these two alternatives can occur.
670bd95e 1735
108b42b4
DH
1736Locks and semaphores may not provide any guarantee of ordering on UP compiled
1737systems, and so cannot be counted on in such a situation to actually achieve
1738anything at all - especially with respect to I/O accesses - unless combined
1739with interrupt disabling operations.
1740
1741See also the section on "Inter-CPU locking barrier effects".
1742
1743
1744As an example, consider the following:
1745
1746 *A = a;
1747 *B = b;
1748 LOCK
1749 *C = c;
1750 *D = d;
1751 UNLOCK
1752 *E = e;
1753 *F = f;
1754
1755The following sequence of events is acceptable:
1756
1757 LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK
1758
1759 [+] Note that {*F,*A} indicates a combined access.
1760
1761But none of the following are:
1762
1763 {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E
1764 *A, *B, *C, LOCK, *D, UNLOCK, *E, *F
1765 *A, *B, LOCK, *C, UNLOCK, *D, *E, *F
1766 *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E
1767
1768
1769
1770INTERRUPT DISABLING FUNCTIONS
1771-----------------------------
1772
1773Functions that disable interrupts (LOCK equivalent) and enable interrupts
1774(UNLOCK equivalent) will act as compiler barriers only. So if memory or I/O
1775barriers are required in such a situation, they must be provided from some
1776other means.
1777
1778
50fa610a
DH
1779SLEEP AND WAKE-UP FUNCTIONS
1780---------------------------
1781
1782Sleeping and waking on an event flagged in global data can be viewed as an
1783interaction between two pieces of data: the task state of the task waiting for
1784the event and the global data used to indicate the event. To make sure that
1785these appear to happen in the right order, the primitives to begin the process
1786of going to sleep, and the primitives to initiate a wake up imply certain
1787barriers.
1788
1789Firstly, the sleeper normally follows something like this sequence of events:
1790
1791 for (;;) {
1792 set_current_state(TASK_UNINTERRUPTIBLE);
1793 if (event_indicated)
1794 break;
1795 schedule();
1796 }
1797
1798A general memory barrier is interpolated automatically by set_current_state()
1799after it has altered the task state:
1800
1801 CPU 1
1802 ===============================
1803 set_current_state();
1804 set_mb();
1805 STORE current->state
1806 <general barrier>
1807 LOAD event_indicated
1808
1809set_current_state() may be wrapped by:
1810
1811 prepare_to_wait();
1812 prepare_to_wait_exclusive();
1813
1814which therefore also imply a general memory barrier after setting the state.
1815The whole sequence above is available in various canned forms, all of which
1816interpolate the memory barrier in the right place:
1817
1818 wait_event();
1819 wait_event_interruptible();
1820 wait_event_interruptible_exclusive();
1821 wait_event_interruptible_timeout();
1822 wait_event_killable();
1823 wait_event_timeout();
1824 wait_on_bit();
1825 wait_on_bit_lock();
1826
1827
1828Secondly, code that performs a wake up normally follows something like this:
1829
1830 event_indicated = 1;
1831 wake_up(&event_wait_queue);
1832
1833or:
1834
1835 event_indicated = 1;
1836 wake_up_process(event_daemon);
1837
1838A write memory barrier is implied by wake_up() and co. if and only if they wake
1839something up. The barrier occurs before the task state is cleared, and so sits
1840between the STORE to indicate the event and the STORE to set TASK_RUNNING:
1841
1842 CPU 1 CPU 2
1843 =============================== ===============================
1844 set_current_state(); STORE event_indicated
1845 set_mb(); wake_up();
1846 STORE current->state <write barrier>
1847 <general barrier> STORE current->state
1848 LOAD event_indicated
1849
1850The available waker functions include:
1851
1852 complete();
1853 wake_up();
1854 wake_up_all();
1855 wake_up_bit();
1856 wake_up_interruptible();
1857 wake_up_interruptible_all();
1858 wake_up_interruptible_nr();
1859 wake_up_interruptible_poll();
1860 wake_up_interruptible_sync();
1861 wake_up_interruptible_sync_poll();
1862 wake_up_locked();
1863 wake_up_locked_poll();
1864 wake_up_nr();
1865 wake_up_poll();
1866 wake_up_process();
1867
1868
1869[!] Note that the memory barriers implied by the sleeper and the waker do _not_
1870order multiple stores before the wake-up with respect to loads of those stored
1871values after the sleeper has called set_current_state(). For instance, if the
1872sleeper does:
1873
1874 set_current_state(TASK_INTERRUPTIBLE);
1875 if (event_indicated)
1876 break;
1877 __set_current_state(TASK_RUNNING);
1878 do_something(my_data);
1879
1880and the waker does:
1881
1882 my_data = value;
1883 event_indicated = 1;
1884 wake_up(&event_wait_queue);
1885
1886there's no guarantee that the change to event_indicated will be perceived by
1887the sleeper as coming after the change to my_data. In such a circumstance, the
1888code on both sides must interpolate its own memory barriers between the
1889separate data accesses. Thus the above sleeper ought to do:
1890
1891 set_current_state(TASK_INTERRUPTIBLE);
1892 if (event_indicated) {
1893 smp_rmb();
1894 do_something(my_data);
1895 }
1896
1897and the waker should do:
1898
1899 my_data = value;
1900 smp_wmb();
1901 event_indicated = 1;
1902 wake_up(&event_wait_queue);
1903
1904
108b42b4
DH
1905MISCELLANEOUS FUNCTIONS
1906-----------------------
1907
1908Other functions that imply barriers:
1909
1910 (*) schedule() and similar imply full memory barriers.
1911
108b42b4
DH
1912
1913=================================
1914INTER-CPU LOCKING BARRIER EFFECTS
1915=================================
1916
1917On SMP systems locking primitives give a more substantial form of barrier: one
1918that does affect memory access ordering on other CPUs, within the context of
1919conflict on any particular lock.
1920
1921
1922LOCKS VS MEMORY ACCESSES
1923------------------------
1924
79afecfa 1925Consider the following: the system has a pair of spinlocks (M) and (Q), and
108b42b4
DH
1926three CPUs; then should the following sequence of events occur:
1927
1928 CPU 1 CPU 2
1929 =============================== ===============================
2ecf8101 1930 ACCESS_ONCE(*A) = a; ACCESS_ONCE(*E) = e;
108b42b4 1931 LOCK M LOCK Q
2ecf8101
PM
1932 ACCESS_ONCE(*B) = b; ACCESS_ONCE(*F) = f;
1933 ACCESS_ONCE(*C) = c; ACCESS_ONCE(*G) = g;
108b42b4 1934 UNLOCK M UNLOCK Q
2ecf8101 1935 ACCESS_ONCE(*D) = d; ACCESS_ONCE(*H) = h;
108b42b4 1936
81fc6323 1937Then there is no guarantee as to what order CPU 3 will see the accesses to *A
108b42b4
DH
1938through *H occur in, other than the constraints imposed by the separate locks
1939on the separate CPUs. It might, for example, see:
1940
1941 *E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M
1942
1943But it won't see any of:
1944
1945 *B, *C or *D preceding LOCK M
1946 *A, *B or *C following UNLOCK M
1947 *F, *G or *H preceding LOCK Q
1948 *E, *F or *G following UNLOCK Q
1949
1950
1951However, if the following occurs:
1952
1953 CPU 1 CPU 2
1954 =============================== ===============================
2ecf8101
PM
1955 ACCESS_ONCE(*A) = a;
1956 LOCK M [1]
1957 ACCESS_ONCE(*B) = b;
1958 ACCESS_ONCE(*C) = c;
1959 UNLOCK M [1]
1960 ACCESS_ONCE(*D) = d; ACCESS_ONCE(*E) = e;
1961 LOCK M [2]
17eb88e0 1962 smp_mb__after_unlock_lock();
2ecf8101
PM
1963 ACCESS_ONCE(*F) = f;
1964 ACCESS_ONCE(*G) = g;
1965 UNLOCK M [2]
1966 ACCESS_ONCE(*H) = h;
108b42b4 1967
81fc6323 1968CPU 3 might see:
108b42b4
DH
1969
1970 *E, LOCK M [1], *C, *B, *A, UNLOCK M [1],
1971 LOCK M [2], *H, *F, *G, UNLOCK M [2], *D
1972
81fc6323 1973But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
108b42b4
DH
1974
1975 *B, *C, *D, *F, *G or *H preceding LOCK M [1]
1976 *A, *B or *C following UNLOCK M [1]
1977 *F, *G or *H preceding LOCK M [2]
1978 *A, *B, *C, *E, *F or *G following UNLOCK M [2]
1979
17eb88e0
PM
1980Note that the smp_mb__after_unlock_lock() is critically important
1981here: Without it CPU 3 might see some of the above orderings.
1982Without smp_mb__after_unlock_lock(), the accesses are not guaranteed
1983to be seen in order unless CPU 3 holds lock M.
1984
108b42b4
DH
1985
1986LOCKS VS I/O ACCESSES
1987---------------------
1988
1989Under certain circumstances (especially involving NUMA), I/O accesses within
1990two spinlocked sections on two different CPUs may be seen as interleaved by the
1991PCI bridge, because the PCI bridge does not necessarily participate in the
1992cache-coherence protocol, and is therefore incapable of issuing the required
1993read memory barriers.
1994
1995For example:
1996
1997 CPU 1 CPU 2
1998 =============================== ===============================
1999 spin_lock(Q)
2000 writel(0, ADDR)
2001 writel(1, DATA);
2002 spin_unlock(Q);
2003 spin_lock(Q);
2004 writel(4, ADDR);
2005 writel(5, DATA);
2006 spin_unlock(Q);
2007
2008may be seen by the PCI bridge as follows:
2009
2010 STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5
2011
2012which would probably cause the hardware to malfunction.
2013
2014
2015What is necessary here is to intervene with an mmiowb() before dropping the
2016spinlock, for example:
2017
2018 CPU 1 CPU 2
2019 =============================== ===============================
2020 spin_lock(Q)
2021 writel(0, ADDR)
2022 writel(1, DATA);
2023 mmiowb();
2024 spin_unlock(Q);
2025 spin_lock(Q);
2026 writel(4, ADDR);
2027 writel(5, DATA);
2028 mmiowb();
2029 spin_unlock(Q);
2030
81fc6323
JP
2031this will ensure that the two stores issued on CPU 1 appear at the PCI bridge
2032before either of the stores issued on CPU 2.
108b42b4
DH
2033
2034
81fc6323
JP
2035Furthermore, following a store by a load from the same device obviates the need
2036for the mmiowb(), because the load forces the store to complete before the load
108b42b4
DH
2037is performed:
2038
2039 CPU 1 CPU 2
2040 =============================== ===============================
2041 spin_lock(Q)
2042 writel(0, ADDR)
2043 a = readl(DATA);
2044 spin_unlock(Q);
2045 spin_lock(Q);
2046 writel(4, ADDR);
2047 b = readl(DATA);
2048 spin_unlock(Q);
2049
2050
2051See Documentation/DocBook/deviceiobook.tmpl for more information.
2052
2053
2054=================================
2055WHERE ARE MEMORY BARRIERS NEEDED?
2056=================================
2057
2058Under normal operation, memory operation reordering is generally not going to
2059be a problem as a single-threaded linear piece of code will still appear to
50fa610a 2060work correctly, even if it's in an SMP kernel. There are, however, four
108b42b4
DH
2061circumstances in which reordering definitely _could_ be a problem:
2062
2063 (*) Interprocessor interaction.
2064
2065 (*) Atomic operations.
2066
81fc6323 2067 (*) Accessing devices.
108b42b4
DH
2068
2069 (*) Interrupts.
2070
2071
2072INTERPROCESSOR INTERACTION
2073--------------------------
2074
2075When there's a system with more than one processor, more than one CPU in the
2076system may be working on the same data set at the same time. This can cause
2077synchronisation problems, and the usual way of dealing with them is to use
2078locks. Locks, however, are quite expensive, and so it may be preferable to
2079operate without the use of a lock if at all possible. In such a case
2080operations that affect both CPUs may have to be carefully ordered to prevent
2081a malfunction.
2082
2083Consider, for example, the R/W semaphore slow path. Here a waiting process is
2084queued on the semaphore, by virtue of it having a piece of its stack linked to
2085the semaphore's list of waiting processes:
2086
2087 struct rw_semaphore {
2088 ...
2089 spinlock_t lock;
2090 struct list_head waiters;
2091 };
2092
2093 struct rwsem_waiter {
2094 struct list_head list;
2095 struct task_struct *task;
2096 };
2097
2098To wake up a particular waiter, the up_read() or up_write() functions have to:
2099
2100 (1) read the next pointer from this waiter's record to know as to where the
2101 next waiter record is;
2102
81fc6323 2103 (2) read the pointer to the waiter's task structure;
108b42b4
DH
2104
2105 (3) clear the task pointer to tell the waiter it has been given the semaphore;
2106
2107 (4) call wake_up_process() on the task; and
2108
2109 (5) release the reference held on the waiter's task struct.
2110
81fc6323 2111In other words, it has to perform this sequence of events:
108b42b4
DH
2112
2113 LOAD waiter->list.next;
2114 LOAD waiter->task;
2115 STORE waiter->task;
2116 CALL wakeup
2117 RELEASE task
2118
2119and if any of these steps occur out of order, then the whole thing may
2120malfunction.
2121
2122Once it has queued itself and dropped the semaphore lock, the waiter does not
2123get the lock again; it instead just waits for its task pointer to be cleared
2124before proceeding. Since the record is on the waiter's stack, this means that
2125if the task pointer is cleared _before_ the next pointer in the list is read,
2126another CPU might start processing the waiter and might clobber the waiter's
2127stack before the up*() function has a chance to read the next pointer.
2128
2129Consider then what might happen to the above sequence of events:
2130
2131 CPU 1 CPU 2
2132 =============================== ===============================
2133 down_xxx()
2134 Queue waiter
2135 Sleep
2136 up_yyy()
2137 LOAD waiter->task;
2138 STORE waiter->task;
2139 Woken up by other event
2140 <preempt>
2141 Resume processing
2142 down_xxx() returns
2143 call foo()
2144 foo() clobbers *waiter
2145 </preempt>
2146 LOAD waiter->list.next;
2147 --- OOPS ---
2148
2149This could be dealt with using the semaphore lock, but then the down_xxx()
2150function has to needlessly get the spinlock again after being woken up.
2151
2152The way to deal with this is to insert a general SMP memory barrier:
2153
2154 LOAD waiter->list.next;
2155 LOAD waiter->task;
2156 smp_mb();
2157 STORE waiter->task;
2158 CALL wakeup
2159 RELEASE task
2160
2161In this case, the barrier makes a guarantee that all memory accesses before the
2162barrier will appear to happen before all the memory accesses after the barrier
2163with respect to the other CPUs on the system. It does _not_ guarantee that all
2164the memory accesses before the barrier will be complete by the time the barrier
2165instruction itself is complete.
2166
2167On a UP system - where this wouldn't be a problem - the smp_mb() is just a
2168compiler barrier, thus making sure the compiler emits the instructions in the
6bc39274
DH
2169right order without actually intervening in the CPU. Since there's only one
2170CPU, that CPU's dependency ordering logic will take care of everything else.
108b42b4
DH
2171
2172
2173ATOMIC OPERATIONS
2174-----------------
2175
dbc8700e
DH
2176Whilst they are technically interprocessor interaction considerations, atomic
2177operations are noted specially as some of them imply full memory barriers and
2178some don't, but they're very heavily relied on as a group throughout the
2179kernel.
2180
2181Any atomic operation that modifies some state in memory and returns information
2182about the state (old or new) implies an SMP-conditional general memory barrier
26333576
NP
2183(smp_mb()) on each side of the actual operation (with the exception of
2184explicit lock operations, described later). These include:
108b42b4
DH
2185
2186 xchg();
2187 cmpxchg();
fb2b5819
PM
2188 atomic_xchg(); atomic_long_xchg();
2189 atomic_cmpxchg(); atomic_long_cmpxchg();
2190 atomic_inc_return(); atomic_long_inc_return();
2191 atomic_dec_return(); atomic_long_dec_return();
2192 atomic_add_return(); atomic_long_add_return();
2193 atomic_sub_return(); atomic_long_sub_return();
2194 atomic_inc_and_test(); atomic_long_inc_and_test();
2195 atomic_dec_and_test(); atomic_long_dec_and_test();
2196 atomic_sub_and_test(); atomic_long_sub_and_test();
2197 atomic_add_negative(); atomic_long_add_negative();
dbc8700e
DH
2198 test_and_set_bit();
2199 test_and_clear_bit();
2200 test_and_change_bit();
2201
fb2b5819
PM
2202 /* when succeeds (returns 1) */
2203 atomic_add_unless(); atomic_long_add_unless();
2204
dbc8700e
DH
2205These are used for such things as implementing LOCK-class and UNLOCK-class
2206operations and adjusting reference counters towards object destruction, and as
2207such the implicit memory barrier effects are necessary.
108b42b4 2208
108b42b4 2209
81fc6323 2210The following operations are potential problems as they do _not_ imply memory
dbc8700e
DH
2211barriers, but might be used for implementing such things as UNLOCK-class
2212operations:
108b42b4 2213
dbc8700e 2214 atomic_set();
108b42b4
DH
2215 set_bit();
2216 clear_bit();
2217 change_bit();
dbc8700e
DH
2218
2219With these the appropriate explicit memory barrier should be used if necessary
2220(smp_mb__before_clear_bit() for instance).
108b42b4
DH
2221
2222
dbc8700e
DH
2223The following also do _not_ imply memory barriers, and so may require explicit
2224memory barriers under some circumstances (smp_mb__before_atomic_dec() for
81fc6323 2225instance):
108b42b4
DH
2226
2227 atomic_add();
2228 atomic_sub();
2229 atomic_inc();
2230 atomic_dec();
2231
2232If they're used for statistics generation, then they probably don't need memory
2233barriers, unless there's a coupling between statistical data.
2234
2235If they're used for reference counting on an object to control its lifetime,
2236they probably don't need memory barriers because either the reference count
2237will be adjusted inside a locked section, or the caller will already hold
2238sufficient references to make the lock, and thus a memory barrier unnecessary.
2239
2240If they're used for constructing a lock of some description, then they probably
2241do need memory barriers as a lock primitive generally has to do things in a
2242specific order.
2243
108b42b4 2244Basically, each usage case has to be carefully considered as to whether memory
dbc8700e
DH
2245barriers are needed or not.
2246
26333576
NP
2247The following operations are special locking primitives:
2248
2249 test_and_set_bit_lock();
2250 clear_bit_unlock();
2251 __clear_bit_unlock();
2252
2253These implement LOCK-class and UNLOCK-class operations. These should be used in
2254preference to other operations when implementing locking primitives, because
2255their implementations can be optimised on many architectures.
2256
dbc8700e
DH
2257[!] Note that special memory barrier primitives are available for these
2258situations because on some CPUs the atomic instructions used imply full memory
2259barriers, and so barrier instructions are superfluous in conjunction with them,
2260and in such cases the special barrier primitives will be no-ops.
108b42b4
DH
2261
2262See Documentation/atomic_ops.txt for more information.
2263
2264
2265ACCESSING DEVICES
2266-----------------
2267
2268Many devices can be memory mapped, and so appear to the CPU as if they're just
2269a set of memory locations. To control such a device, the driver usually has to
2270make the right memory accesses in exactly the right order.
2271
2272However, having a clever CPU or a clever compiler creates a potential problem
2273in that the carefully sequenced accesses in the driver code won't reach the
2274device in the requisite order if the CPU or the compiler thinks it is more
2275efficient to reorder, combine or merge accesses - something that would cause
2276the device to malfunction.
2277
2278Inside of the Linux kernel, I/O should be done through the appropriate accessor
2279routines - such as inb() or writel() - which know how to make such accesses
2280appropriately sequential. Whilst this, for the most part, renders the explicit
2281use of memory barriers unnecessary, there are a couple of situations where they
2282might be needed:
2283
2284 (1) On some systems, I/O stores are not strongly ordered across all CPUs, and
2285 so for _all_ general drivers locks should be used and mmiowb() must be
2286 issued prior to unlocking the critical section.
2287
2288 (2) If the accessor functions are used to refer to an I/O memory window with
2289 relaxed memory access properties, then _mandatory_ memory barriers are
2290 required to enforce ordering.
2291
2292See Documentation/DocBook/deviceiobook.tmpl for more information.
2293
2294
2295INTERRUPTS
2296----------
2297
2298A driver may be interrupted by its own interrupt service routine, and thus the
2299two parts of the driver may interfere with each other's attempts to control or
2300access the device.
2301
2302This may be alleviated - at least in part - by disabling local interrupts (a
2303form of locking), such that the critical operations are all contained within
2304the interrupt-disabled section in the driver. Whilst the driver's interrupt
2305routine is executing, the driver's core may not run on the same CPU, and its
2306interrupt is not permitted to happen again until the current interrupt has been
2307handled, thus the interrupt handler does not need to lock against that.
2308
2309However, consider a driver that was talking to an ethernet card that sports an
2310address register and a data register. If that driver's core talks to the card
2311under interrupt-disablement and then the driver's interrupt handler is invoked:
2312
2313 LOCAL IRQ DISABLE
2314 writew(ADDR, 3);
2315 writew(DATA, y);
2316 LOCAL IRQ ENABLE
2317 <interrupt>
2318 writew(ADDR, 4);
2319 q = readw(DATA);
2320 </interrupt>
2321
2322The store to the data register might happen after the second store to the
2323address register if ordering rules are sufficiently relaxed:
2324
2325 STORE *ADDR = 3, STORE *ADDR = 4, STORE *DATA = y, q = LOAD *DATA
2326
2327
2328If ordering rules are relaxed, it must be assumed that accesses done inside an
2329interrupt disabled section may leak outside of it and may interleave with
2330accesses performed in an interrupt - and vice versa - unless implicit or
2331explicit barriers are used.
2332
2333Normally this won't be a problem because the I/O accesses done inside such
2334sections will include synchronous load operations on strictly ordered I/O
2335registers that form implicit I/O barriers. If this isn't sufficient then an
2336mmiowb() may need to be used explicitly.
2337
2338
2339A similar situation may occur between an interrupt routine and two routines
2340running on separate CPUs that communicate with each other. If such a case is
2341likely, then interrupt-disabling locks should be used to guarantee ordering.
2342
2343
2344==========================
2345KERNEL I/O BARRIER EFFECTS
2346==========================
2347
2348When accessing I/O memory, drivers should use the appropriate accessor
2349functions:
2350
2351 (*) inX(), outX():
2352
2353 These are intended to talk to I/O space rather than memory space, but
2354 that's primarily a CPU-specific concept. The i386 and x86_64 processors do
2355 indeed have special I/O space access cycles and instructions, but many
2356 CPUs don't have such a concept.
2357
81fc6323
JP
2358 The PCI bus, amongst others, defines an I/O space concept which - on such
2359 CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O
6bc39274
DH
2360 space. However, it may also be mapped as a virtual I/O space in the CPU's
2361 memory map, particularly on those CPUs that don't support alternate I/O
2362 spaces.
108b42b4
DH
2363
2364 Accesses to this space may be fully synchronous (as on i386), but
2365 intermediary bridges (such as the PCI host bridge) may not fully honour
2366 that.
2367
2368 They are guaranteed to be fully ordered with respect to each other.
2369
2370 They are not guaranteed to be fully ordered with respect to other types of
2371 memory and I/O operation.
2372
2373 (*) readX(), writeX():
2374
2375 Whether these are guaranteed to be fully ordered and uncombined with
2376 respect to each other on the issuing CPU depends on the characteristics
2377 defined for the memory window through which they're accessing. On later
2378 i386 architecture machines, for example, this is controlled by way of the
2379 MTRR registers.
2380
81fc6323 2381 Ordinarily, these will be guaranteed to be fully ordered and uncombined,
108b42b4
DH
2382 provided they're not accessing a prefetchable device.
2383
2384 However, intermediary hardware (such as a PCI bridge) may indulge in
2385 deferral if it so wishes; to flush a store, a load from the same location
2386 is preferred[*], but a load from the same device or from configuration
2387 space should suffice for PCI.
2388
2389 [*] NOTE! attempting to load from the same location as was written to may
e0edc78f
IM
2390 cause a malfunction - consider the 16550 Rx/Tx serial registers for
2391 example.
108b42b4
DH
2392
2393 Used with prefetchable I/O memory, an mmiowb() barrier may be required to
2394 force stores to be ordered.
2395
2396 Please refer to the PCI specification for more information on interactions
2397 between PCI transactions.
2398
2399 (*) readX_relaxed()
2400
2401 These are similar to readX(), but are not guaranteed to be ordered in any
2402 way. Be aware that there is no I/O read barrier available.
2403
2404 (*) ioreadX(), iowriteX()
2405
81fc6323 2406 These will perform appropriately for the type of access they're actually
108b42b4
DH
2407 doing, be it inX()/outX() or readX()/writeX().
2408
2409
2410========================================
2411ASSUMED MINIMUM EXECUTION ORDERING MODEL
2412========================================
2413
2414It has to be assumed that the conceptual CPU is weakly-ordered but that it will
2415maintain the appearance of program causality with respect to itself. Some CPUs
2416(such as i386 or x86_64) are more constrained than others (such as powerpc or
2417frv), and so the most relaxed case (namely DEC Alpha) must be assumed outside
2418of arch-specific code.
2419
2420This means that it must be considered that the CPU will execute its instruction
2421stream in any order it feels like - or even in parallel - provided that if an
81fc6323 2422instruction in the stream depends on an earlier instruction, then that
108b42b4
DH
2423earlier instruction must be sufficiently complete[*] before the later
2424instruction may proceed; in other words: provided that the appearance of
2425causality is maintained.
2426
2427 [*] Some instructions have more than one effect - such as changing the
2428 condition codes, changing registers or changing memory - and different
2429 instructions may depend on different effects.
2430
2431A CPU may also discard any instruction sequence that winds up having no
2432ultimate effect. For example, if two adjacent instructions both load an
2433immediate value into the same register, the first may be discarded.
2434
2435
2436Similarly, it has to be assumed that compiler might reorder the instruction
2437stream in any way it sees fit, again provided the appearance of causality is
2438maintained.
2439
2440
2441============================
2442THE EFFECTS OF THE CPU CACHE
2443============================
2444
2445The way cached memory operations are perceived across the system is affected to
2446a certain extent by the caches that lie between CPUs and memory, and by the
2447memory coherence system that maintains the consistency of state in the system.
2448
2449As far as the way a CPU interacts with another part of the system through the
2450caches goes, the memory system has to include the CPU's caches, and memory
2451barriers for the most part act at the interface between the CPU and its cache
2452(memory barriers logically act on the dotted line in the following diagram):
2453
2454 <--- CPU ---> : <----------- Memory ----------->
2455 :
2456 +--------+ +--------+ : +--------+ +-----------+
2457 | | | | : | | | | +--------+
e0edc78f
IM
2458 | CPU | | Memory | : | CPU | | | | |
2459 | Core |--->| Access |----->| Cache |<-->| | | |
108b42b4 2460 | | | Queue | : | | | |--->| Memory |
e0edc78f
IM
2461 | | | | : | | | | | |
2462 +--------+ +--------+ : +--------+ | | | |
108b42b4
DH
2463 : | Cache | +--------+
2464 : | Coherency |
2465 : | Mechanism | +--------+
2466 +--------+ +--------+ : +--------+ | | | |
2467 | | | | : | | | | | |
2468 | CPU | | Memory | : | CPU | | |--->| Device |
e0edc78f
IM
2469 | Core |--->| Access |----->| Cache |<-->| | | |
2470 | | | Queue | : | | | | | |
108b42b4
DH
2471 | | | | : | | | | +--------+
2472 +--------+ +--------+ : +--------+ +-----------+
2473 :
2474 :
2475
2476Although any particular load or store may not actually appear outside of the
2477CPU that issued it since it may have been satisfied within the CPU's own cache,
2478it will still appear as if the full memory access had taken place as far as the
2479other CPUs are concerned since the cache coherency mechanisms will migrate the
2480cacheline over to the accessing CPU and propagate the effects upon conflict.
2481
2482The CPU core may execute instructions in any order it deems fit, provided the
2483expected program causality appears to be maintained. Some of the instructions
2484generate load and store operations which then go into the queue of memory
2485accesses to be performed. The core may place these in the queue in any order
2486it wishes, and continue execution until it is forced to wait for an instruction
2487to complete.
2488
2489What memory barriers are concerned with is controlling the order in which
2490accesses cross from the CPU side of things to the memory side of things, and
2491the order in which the effects are perceived to happen by the other observers
2492in the system.
2493
2494[!] Memory barriers are _not_ needed within a given CPU, as CPUs always see
2495their own loads and stores as if they had happened in program order.
2496
2497[!] MMIO or other device accesses may bypass the cache system. This depends on
2498the properties of the memory window through which devices are accessed and/or
2499the use of any special device communication instructions the CPU may have.
2500
2501
2502CACHE COHERENCY
2503---------------
2504
2505Life isn't quite as simple as it may appear above, however: for while the
2506caches are expected to be coherent, there's no guarantee that that coherency
2507will be ordered. This means that whilst changes made on one CPU will
2508eventually become visible on all CPUs, there's no guarantee that they will
2509become apparent in the same order on those other CPUs.
2510
2511
81fc6323
JP
2512Consider dealing with a system that has a pair of CPUs (1 & 2), each of which
2513has a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D):
108b42b4
DH
2514
2515 :
2516 : +--------+
2517 : +---------+ | |
2518 +--------+ : +--->| Cache A |<------->| |
2519 | | : | +---------+ | |
2520 | CPU 1 |<---+ | |
2521 | | : | +---------+ | |
2522 +--------+ : +--->| Cache B |<------->| |
2523 : +---------+ | |
2524 : | Memory |
2525 : +---------+ | System |
2526 +--------+ : +--->| Cache C |<------->| |
2527 | | : | +---------+ | |
2528 | CPU 2 |<---+ | |
2529 | | : | +---------+ | |
2530 +--------+ : +--->| Cache D |<------->| |
2531 : +---------+ | |
2532 : +--------+
2533 :
2534
2535Imagine the system has the following properties:
2536
2537 (*) an odd-numbered cache line may be in cache A, cache C or it may still be
2538 resident in memory;
2539
2540 (*) an even-numbered cache line may be in cache B, cache D or it may still be
2541 resident in memory;
2542
2543 (*) whilst the CPU core is interrogating one cache, the other cache may be
2544 making use of the bus to access the rest of the system - perhaps to
2545 displace a dirty cacheline or to do a speculative load;
2546
2547 (*) each cache has a queue of operations that need to be applied to that cache
2548 to maintain coherency with the rest of the system;
2549
2550 (*) the coherency queue is not flushed by normal loads to lines already
2551 present in the cache, even though the contents of the queue may
81fc6323 2552 potentially affect those loads.
108b42b4
DH
2553
2554Imagine, then, that two writes are made on the first CPU, with a write barrier
2555between them to guarantee that they will appear to reach that CPU's caches in
2556the requisite order:
2557
2558 CPU 1 CPU 2 COMMENT
2559 =============== =============== =======================================
2560 u == 0, v == 1 and p == &u, q == &u
2561 v = 2;
81fc6323 2562 smp_wmb(); Make sure change to v is visible before
108b42b4
DH
2563 change to p
2564 <A:modify v=2> v is now in cache A exclusively
2565 p = &v;
2566 <B:modify p=&v> p is now in cache B exclusively
2567
2568The write memory barrier forces the other CPUs in the system to perceive that
2569the local CPU's caches have apparently been updated in the correct order. But
81fc6323 2570now imagine that the second CPU wants to read those values:
108b42b4
DH
2571
2572 CPU 1 CPU 2 COMMENT
2573 =============== =============== =======================================
2574 ...
2575 q = p;
2576 x = *q;
2577
81fc6323 2578The above pair of reads may then fail to happen in the expected order, as the
108b42b4
DH
2579cacheline holding p may get updated in one of the second CPU's caches whilst
2580the update to the cacheline holding v is delayed in the other of the second
2581CPU's caches by some other cache event:
2582
2583 CPU 1 CPU 2 COMMENT
2584 =============== =============== =======================================
2585 u == 0, v == 1 and p == &u, q == &u
2586 v = 2;
2587 smp_wmb();
2588 <A:modify v=2> <C:busy>
2589 <C:queue v=2>
79afecfa 2590 p = &v; q = p;
108b42b4
DH
2591 <D:request p>
2592 <B:modify p=&v> <D:commit p=&v>
e0edc78f 2593 <D:read p>
108b42b4
DH
2594 x = *q;
2595 <C:read *q> Reads from v before v updated in cache
2596 <C:unbusy>
2597 <C:commit v=2>
2598
2599Basically, whilst both cachelines will be updated on CPU 2 eventually, there's
2600no guarantee that, without intervention, the order of update will be the same
2601as that committed on CPU 1.
2602
2603
2604To intervene, we need to interpolate a data dependency barrier or a read
2605barrier between the loads. This will force the cache to commit its coherency
2606queue before processing any further requests:
2607
2608 CPU 1 CPU 2 COMMENT
2609 =============== =============== =======================================
2610 u == 0, v == 1 and p == &u, q == &u
2611 v = 2;
2612 smp_wmb();
2613 <A:modify v=2> <C:busy>
2614 <C:queue v=2>
3fda982c 2615 p = &v; q = p;
108b42b4
DH
2616 <D:request p>
2617 <B:modify p=&v> <D:commit p=&v>
e0edc78f 2618 <D:read p>
108b42b4
DH
2619 smp_read_barrier_depends()
2620 <C:unbusy>
2621 <C:commit v=2>
2622 x = *q;
2623 <C:read *q> Reads from v after v updated in cache
2624
2625
2626This sort of problem can be encountered on DEC Alpha processors as they have a
2627split cache that improves performance by making better use of the data bus.
2628Whilst most CPUs do imply a data dependency barrier on the read when a memory
2629access depends on a read, not all do, so it may not be relied on.
2630
2631Other CPUs may also have split caches, but must coordinate between the various
3f6dee9b 2632cachelets for normal memory accesses. The semantics of the Alpha removes the
81fc6323 2633need for coordination in the absence of memory barriers.
108b42b4
DH
2634
2635
2636CACHE COHERENCY VS DMA
2637----------------------
2638
2639Not all systems maintain cache coherency with respect to devices doing DMA. In
2640such cases, a device attempting DMA may obtain stale data from RAM because
2641dirty cache lines may be resident in the caches of various CPUs, and may not
2642have been written back to RAM yet. To deal with this, the appropriate part of
2643the kernel must flush the overlapping bits of cache on each CPU (and maybe
2644invalidate them as well).
2645
2646In addition, the data DMA'd to RAM by a device may be overwritten by dirty
2647cache lines being written back to RAM from a CPU's cache after the device has
81fc6323
JP
2648installed its own data, or cache lines present in the CPU's cache may simply
2649obscure the fact that RAM has been updated, until at such time as the cacheline
2650is discarded from the CPU's cache and reloaded. To deal with this, the
2651appropriate part of the kernel must invalidate the overlapping bits of the
108b42b4
DH
2652cache on each CPU.
2653
2654See Documentation/cachetlb.txt for more information on cache management.
2655
2656
2657CACHE COHERENCY VS MMIO
2658-----------------------
2659
2660Memory mapped I/O usually takes place through memory locations that are part of
81fc6323 2661a window in the CPU's memory space that has different properties assigned than
108b42b4
DH
2662the usual RAM directed window.
2663
2664Amongst these properties is usually the fact that such accesses bypass the
2665caching entirely and go directly to the device buses. This means MMIO accesses
2666may, in effect, overtake accesses to cached memory that were emitted earlier.
2667A memory barrier isn't sufficient in such a case, but rather the cache must be
2668flushed between the cached memory write and the MMIO access if the two are in
2669any way dependent.
2670
2671
2672=========================
2673THE THINGS CPUS GET UP TO
2674=========================
2675
2676A programmer might take it for granted that the CPU will perform memory
81fc6323 2677operations in exactly the order specified, so that if the CPU is, for example,
108b42b4
DH
2678given the following piece of code to execute:
2679
2ecf8101
PM
2680 a = ACCESS_ONCE(*A);
2681 ACCESS_ONCE(*B) = b;
2682 c = ACCESS_ONCE(*C);
2683 d = ACCESS_ONCE(*D);
2684 ACCESS_ONCE(*E) = e;
108b42b4 2685
81fc6323 2686they would then expect that the CPU will complete the memory operation for each
108b42b4
DH
2687instruction before moving on to the next one, leading to a definite sequence of
2688operations as seen by external observers in the system:
2689
2690 LOAD *A, STORE *B, LOAD *C, LOAD *D, STORE *E.
2691
2692
2693Reality is, of course, much messier. With many CPUs and compilers, the above
2694assumption doesn't hold because:
2695
2696 (*) loads are more likely to need to be completed immediately to permit
2697 execution progress, whereas stores can often be deferred without a
2698 problem;
2699
2700 (*) loads may be done speculatively, and the result discarded should it prove
2701 to have been unnecessary;
2702
81fc6323
JP
2703 (*) loads may be done speculatively, leading to the result having been fetched
2704 at the wrong time in the expected sequence of events;
108b42b4
DH
2705
2706 (*) the order of the memory accesses may be rearranged to promote better use
2707 of the CPU buses and caches;
2708
2709 (*) loads and stores may be combined to improve performance when talking to
2710 memory or I/O hardware that can do batched accesses of adjacent locations,
2711 thus cutting down on transaction setup costs (memory and PCI devices may
2712 both be able to do this); and
2713
2714 (*) the CPU's data cache may affect the ordering, and whilst cache-coherency
2715 mechanisms may alleviate this - once the store has actually hit the cache
2716 - there's no guarantee that the coherency management will be propagated in
2717 order to other CPUs.
2718
2719So what another CPU, say, might actually observe from the above piece of code
2720is:
2721
2722 LOAD *A, ..., LOAD {*C,*D}, STORE *E, STORE *B
2723
2724 (Where "LOAD {*C,*D}" is a combined load)
2725
2726
2727However, it is guaranteed that a CPU will be self-consistent: it will see its
2728_own_ accesses appear to be correctly ordered, without the need for a memory
2729barrier. For instance with the following code:
2730
2ecf8101
PM
2731 U = ACCESS_ONCE(*A);
2732 ACCESS_ONCE(*A) = V;
2733 ACCESS_ONCE(*A) = W;
2734 X = ACCESS_ONCE(*A);
2735 ACCESS_ONCE(*A) = Y;
2736 Z = ACCESS_ONCE(*A);
108b42b4
DH
2737
2738and assuming no intervention by an external influence, it can be assumed that
2739the final result will appear to be:
2740
2741 U == the original value of *A
2742 X == W
2743 Z == Y
2744 *A == Y
2745
2746The code above may cause the CPU to generate the full sequence of memory
2747accesses:
2748
2749 U=LOAD *A, STORE *A=V, STORE *A=W, X=LOAD *A, STORE *A=Y, Z=LOAD *A
2750
2751in that order, but, without intervention, the sequence may have almost any
2752combination of elements combined or discarded, provided the program's view of
2ecf8101
PM
2753the world remains consistent. Note that ACCESS_ONCE() is -not- optional
2754in the above example, as there are architectures where a given CPU might
2755interchange successive loads to the same location. On such architectures,
2756ACCESS_ONCE() does whatever is necessary to prevent this, for example, on
2757Itanium the volatile casts used by ACCESS_ONCE() cause GCC to emit the
2758special ld.acq and st.rel instructions that prevent such reordering.
108b42b4
DH
2759
2760The compiler may also combine, discard or defer elements of the sequence before
2761the CPU even sees them.
2762
2763For instance:
2764
2765 *A = V;
2766 *A = W;
2767
2768may be reduced to:
2769
2770 *A = W;
2771
2ecf8101
PM
2772since, without either a write barrier or an ACCESS_ONCE(), it can be
2773assumed that the effect of the storage of V to *A is lost. Similarly:
108b42b4
DH
2774
2775 *A = Y;
2776 Z = *A;
2777
2ecf8101 2778may, without a memory barrier or an ACCESS_ONCE(), be reduced to:
108b42b4
DH
2779
2780 *A = Y;
2781 Z = Y;
2782
2783and the LOAD operation never appear outside of the CPU.
2784
2785
2786AND THEN THERE'S THE ALPHA
2787--------------------------
2788
2789The DEC Alpha CPU is one of the most relaxed CPUs there is. Not only that,
2790some versions of the Alpha CPU have a split data cache, permitting them to have
81fc6323 2791two semantically-related cache lines updated at separate times. This is where
108b42b4
DH
2792the data dependency barrier really becomes necessary as this synchronises both
2793caches with the memory coherence system, thus making it seem like pointer
2794changes vs new data occur in the right order.
2795
81fc6323 2796The Alpha defines the Linux kernel's memory barrier model.
108b42b4
DH
2797
2798See the subsection on "Cache Coherency" above.
2799
2800
90fddabf
DH
2801============
2802EXAMPLE USES
2803============
2804
2805CIRCULAR BUFFERS
2806----------------
2807
2808Memory barriers can be used to implement circular buffering without the need
2809of a lock to serialise the producer with the consumer. See:
2810
2811 Documentation/circular-buffers.txt
2812
2813for details.
2814
2815
108b42b4
DH
2816==========
2817REFERENCES
2818==========
2819
2820Alpha AXP Architecture Reference Manual, Second Edition (Sites & Witek,
2821Digital Press)
2822 Chapter 5.2: Physical Address Space Characteristics
2823 Chapter 5.4: Caches and Write Buffers
2824 Chapter 5.5: Data Sharing
2825 Chapter 5.6: Read/Write Ordering
2826
2827AMD64 Architecture Programmer's Manual Volume 2: System Programming
2828 Chapter 7.1: Memory-Access Ordering
2829 Chapter 7.4: Buffering and Combining Memory Writes
2830
2831IA-32 Intel Architecture Software Developer's Manual, Volume 3:
2832System Programming Guide
2833 Chapter 7.1: Locked Atomic Operations
2834 Chapter 7.2: Memory Ordering
2835 Chapter 7.4: Serializing Instructions
2836
2837The SPARC Architecture Manual, Version 9
2838 Chapter 8: Memory Models
2839 Appendix D: Formal Specification of the Memory Models
2840 Appendix J: Programming with the Memory Models
2841
2842UltraSPARC Programmer Reference Manual
2843 Chapter 5: Memory Accesses and Cacheability
2844 Chapter 15: Sparc-V9 Memory Models
2845
2846UltraSPARC III Cu User's Manual
2847 Chapter 9: Memory Models
2848
2849UltraSPARC IIIi Processor User's Manual
2850 Chapter 8: Memory Models
2851
2852UltraSPARC Architecture 2005
2853 Chapter 9: Memory
2854 Appendix D: Formal Specifications of the Memory Models
2855
2856UltraSPARC T1 Supplement to the UltraSPARC Architecture 2005
2857 Chapter 8: Memory Models
2858 Appendix F: Caches and Cache Coherency
2859
2860Solaris Internals, Core Kernel Architecture, p63-68:
2861 Chapter 3.3: Hardware Considerations for Locks and
2862 Synchronization
2863
2864Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching
2865for Kernel Programmers:
2866 Chapter 13: Other Memory Models
2867
2868Intel Itanium Architecture Software Developer's Manual: Volume 1:
2869 Section 2.6: Speculation
2870 Section 4.4: Memory Access