]>
Commit | Line | Data |
---|---|---|
5444e768 PB |
1 | CPUs perform independent memory operations effectively in random order. |
2 | but this can be a problem for CPU-CPU interaction (including interactions | |
3 | between QEMU and the guest). Multi-threaded programs use various tools | |
4 | to instruct the compiler and the CPU to restrict the order to something | |
5 | that is consistent with the expectations of the programmer. | |
6 | ||
7 | The most basic tool is locking. Mutexes, condition variables and | |
8 | semaphores are used in QEMU, and should be the default approach to | |
9 | synchronization. Anything else is considerably harder, but it's | |
10 | also justified more often than one would like. The two tools that | |
11 | are provided by qemu/atomic.h are memory barriers and atomic operations. | |
12 | ||
13 | Macros defined by qemu/atomic.h fall in three camps: | |
14 | ||
15 | - compiler barriers: barrier(); | |
16 | ||
17 | - weak atomic access and manual memory barriers: atomic_read(), | |
f1ee8696 PB |
18 | atomic_set(), smp_rmb(), smp_wmb(), smp_mb(), smp_mb_acquire(), |
19 | smp_mb_release(), smp_read_barrier_depends(); | |
5444e768 PB |
20 | |
21 | - sequentially consistent atomic access: everything else. | |
22 | ||
23 | ||
24 | COMPILER MEMORY BARRIER | |
25 | ======================= | |
26 | ||
27 | barrier() prevents the compiler from moving the memory accesses either | |
28 | side of it to the other side. The compiler barrier has no direct effect | |
29 | on the CPU, which may then reorder things however it wishes. | |
30 | ||
31 | barrier() is mostly used within qemu/atomic.h itself. On some | |
32 | architectures, CPU guarantees are strong enough that blocking compiler | |
33 | optimizations already ensures the correct order of execution. In this | |
34 | case, qemu/atomic.h will reduce stronger memory barriers to simple | |
35 | compiler barriers. | |
36 | ||
37 | Still, barrier() can be useful when writing code that can be interrupted | |
38 | by signal handlers. | |
39 | ||
40 | ||
41 | SEQUENTIALLY CONSISTENT ATOMIC ACCESS | |
42 | ===================================== | |
43 | ||
44 | Most of the operations in the qemu/atomic.h header ensure *sequential | |
45 | consistency*, where "the result of any execution is the same as if the | |
46 | operations of all the processors were executed in some sequential order, | |
47 | and the operations of each individual processor appear in this sequence | |
48 | in the order specified by its program". | |
49 | ||
50 | qemu/atomic.h provides the following set of atomic read-modify-write | |
51 | operations: | |
52 | ||
53 | void atomic_inc(ptr) | |
54 | void atomic_dec(ptr) | |
55 | void atomic_add(ptr, val) | |
56 | void atomic_sub(ptr, val) | |
57 | void atomic_and(ptr, val) | |
58 | void atomic_or(ptr, val) | |
59 | ||
60 | typeof(*ptr) atomic_fetch_inc(ptr) | |
61 | typeof(*ptr) atomic_fetch_dec(ptr) | |
62 | typeof(*ptr) atomic_fetch_add(ptr, val) | |
63 | typeof(*ptr) atomic_fetch_sub(ptr, val) | |
64 | typeof(*ptr) atomic_fetch_and(ptr, val) | |
65 | typeof(*ptr) atomic_fetch_or(ptr, val) | |
db81b995 | 66 | typeof(*ptr) atomic_fetch_xor(ptr, val) |
447b0d0b | 67 | typeof(*ptr) atomic_fetch_inc_nonzero(ptr) |
dfc007f7 | 68 | typeof(*ptr) atomic_xchg(ptr, val) |
5444e768 PB |
69 | typeof(*ptr) atomic_cmpxchg(ptr, old, new) |
70 | ||
71 | all of which return the old value of *ptr. These operations are | |
db81b995 PB |
72 | polymorphic; they operate on any type that is as wide as a pointer. |
73 | ||
74 | Similar operations return the new value of *ptr: | |
75 | ||
76 | typeof(*ptr) atomic_inc_fetch(ptr) | |
77 | typeof(*ptr) atomic_dec_fetch(ptr) | |
78 | typeof(*ptr) atomic_add_fetch(ptr, val) | |
79 | typeof(*ptr) atomic_sub_fetch(ptr, val) | |
80 | typeof(*ptr) atomic_and_fetch(ptr, val) | |
81 | typeof(*ptr) atomic_or_fetch(ptr, val) | |
82 | typeof(*ptr) atomic_xor_fetch(ptr, val) | |
5444e768 PB |
83 | |
84 | Sequentially consistent loads and stores can be done using: | |
85 | ||
86 | atomic_fetch_add(ptr, 0) for loads | |
87 | atomic_xchg(ptr, val) for stores | |
88 | ||
89 | However, they are quite expensive on some platforms, notably POWER and | |
90 | ARM. Therefore, qemu/atomic.h provides two primitives with slightly | |
91 | weaker constraints: | |
92 | ||
93 | typeof(*ptr) atomic_mb_read(ptr) | |
94 | void atomic_mb_set(ptr, val) | |
95 | ||
96 | The semantics of these primitives map to Java volatile variables, | |
97 | and are strongly related to memory barriers as used in the Linux | |
98 | kernel (see below). | |
99 | ||
100 | As long as you use atomic_mb_read and atomic_mb_set, accesses cannot | |
101 | be reordered with each other, and it is also not possible to reorder | |
102 | "normal" accesses around them. | |
103 | ||
104 | However, and this is the important difference between | |
105 | atomic_mb_read/atomic_mb_set and sequential consistency, it is important | |
106 | for both threads to access the same volatile variable. It is not the | |
107 | case that everything visible to thread A when it writes volatile field f | |
108 | becomes visible to thread B after it reads volatile field g. The store | |
109 | and load have to "match" (i.e., be performed on the same volatile | |
110 | field) to achieve the right semantics. | |
111 | ||
112 | ||
113 | These operations operate on any type that is as wide as an int or smaller. | |
114 | ||
115 | ||
116 | WEAK ATOMIC ACCESS AND MANUAL MEMORY BARRIERS | |
117 | ============================================= | |
118 | ||
119 | Compared to sequentially consistent atomic access, programming with | |
120 | weaker consistency models can be considerably more complicated. | |
121 | In general, if the algorithm you are writing includes both writes | |
122 | and reads on the same side, it is generally simpler to use sequentially | |
123 | consistent primitives. | |
124 | ||
729c0ddd PB |
125 | When using this model, variables are accessed with: |
126 | ||
127 | - atomic_read() and atomic_set(); these prevent the compiler from | |
128 | optimizing accesses out of existence and creating unsolicited | |
129 | accesses, but do not otherwise impose any ordering on loads and | |
130 | stores: both the compiler and the processor are free to reorder | |
131 | them. | |
132 | ||
133 | - atomic_load_acquire(), which guarantees the LOAD to appear to | |
134 | happen, with respect to the other components of the system, | |
135 | before all the LOAD or STORE operations specified afterwards. | |
136 | Operations coming before atomic_load_acquire() can still be | |
137 | reordered after it. | |
138 | ||
139 | - atomic_store_release(), which guarantees the STORE to appear to | |
140 | happen, with respect to the other components of the system, | |
141 | after all the LOAD or STORE operations specified afterwards. | |
142 | Operations coming after atomic_store_release() can still be | |
143 | reordered after it. | |
144 | ||
145 | Restrictions to the ordering of accesses can also be specified | |
f1ee8696 PB |
146 | using the memory barrier macros: smp_rmb(), smp_wmb(), smp_mb(), |
147 | smp_mb_acquire(), smp_mb_release(), smp_read_barrier_depends(). | |
5444e768 | 148 | |
5444e768 | 149 | Memory barriers control the order of references to shared memory. |
f1ee8696 | 150 | They come in six kinds: |
5444e768 PB |
151 | |
152 | - smp_rmb() guarantees that all the LOAD operations specified before | |
153 | the barrier will appear to happen before all the LOAD operations | |
154 | specified after the barrier with respect to the other components of | |
155 | the system. | |
156 | ||
157 | In other words, smp_rmb() puts a partial ordering on loads, but is not | |
158 | required to have any effect on stores. | |
159 | ||
160 | - smp_wmb() guarantees that all the STORE operations specified before | |
161 | the barrier will appear to happen before all the STORE operations | |
162 | specified after the barrier with respect to the other components of | |
163 | the system. | |
164 | ||
165 | In other words, smp_wmb() puts a partial ordering on stores, but is not | |
166 | required to have any effect on loads. | |
167 | ||
f1ee8696 PB |
168 | - smp_mb_acquire() guarantees that all the LOAD operations specified before |
169 | the barrier will appear to happen before all the LOAD or STORE operations | |
170 | specified after the barrier with respect to the other components of | |
171 | the system. | |
172 | ||
173 | - smp_mb_release() guarantees that all the STORE operations specified *after* | |
174 | the barrier will appear to happen after all the LOAD or STORE operations | |
175 | specified *before* the barrier with respect to the other components of | |
176 | the system. | |
177 | ||
5444e768 PB |
178 | - smp_mb() guarantees that all the LOAD and STORE operations specified |
179 | before the barrier will appear to happen before all the LOAD and | |
180 | STORE operations specified after the barrier with respect to the other | |
181 | components of the system. | |
182 | ||
183 | smp_mb() puts a partial ordering on both loads and stores. It is | |
184 | stronger than both a read and a write memory barrier; it implies both | |
f1ee8696 PB |
185 | smp_mb_acquire() and smp_mb_release(), but it also prevents STOREs |
186 | coming before the barrier from overtaking LOADs coming after the | |
187 | barrier and vice versa. | |
5444e768 PB |
188 | |
189 | - smp_read_barrier_depends() is a weaker kind of read barrier. On | |
190 | most processors, whenever two loads are performed such that the | |
191 | second depends on the result of the first (e.g., the first load | |
192 | retrieves the address to which the second load will be directed), | |
193 | the processor will guarantee that the first LOAD will appear to happen | |
194 | before the second with respect to the other components of the system. | |
195 | However, this is not always true---for example, it was not true on | |
196 | Alpha processors. Whenever this kind of access happens to shared | |
197 | memory (that is not protected by a lock), a read barrier is needed, | |
198 | and smp_read_barrier_depends() can be used instead of smp_rmb(). | |
199 | ||
200 | Note that the first load really has to have a _data_ dependency and not | |
201 | a control dependency. If the address for the second load is dependent | |
202 | on the first load, but the dependency is through a conditional rather | |
203 | than actually loading the address itself, then it's a _control_ | |
204 | dependency and a full read barrier or better is required. | |
205 | ||
206 | ||
207 | This is the set of barriers that is required *between* two atomic_read() | |
208 | and atomic_set() operations to achieve sequential consistency: | |
209 | ||
f1ee8696 PB |
210 | | 2nd operation | |
211 | |-----------------------------------------------| | |
212 | 1st operation | (after last) | atomic_read | atomic_set | | |
213 | ---------------+----------------+-------------+----------------| | |
214 | (before first) | | none | smp_mb_release | | |
215 | ---------------+----------------+-------------+----------------| | |
216 | atomic_read | smp_mb_acquire | smp_rmb | ** | | |
217 | ---------------+----------------+-------------+----------------| | |
218 | atomic_set | none | smp_mb()*** | smp_wmb() | | |
219 | ---------------+----------------+-------------+----------------| | |
5444e768 PB |
220 | |
221 | * Or smp_read_barrier_depends(). | |
222 | ||
f1ee8696 PB |
223 | ** This requires a load-store barrier. This is achieved by |
224 | either smp_mb_acquire() or smp_mb_release(). | |
5444e768 PB |
225 | |
226 | *** This requires a store-load barrier. On most machines, the only | |
227 | way to achieve this is a full barrier. | |
228 | ||
229 | ||
230 | You can see that the two possible definitions of atomic_mb_read() | |
231 | and atomic_mb_set() are the following: | |
232 | ||
f1ee8696 PB |
233 | 1) atomic_mb_read(p) = atomic_read(p); smp_mb_acquire() |
234 | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb() | |
5444e768 | 235 | |
f1ee8696 PB |
236 | 2) atomic_mb_read(p) = smp_mb() atomic_read(p); smp_mb_acquire() |
237 | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); | |
5444e768 PB |
238 | |
239 | Usually the former is used, because smp_mb() is expensive and a program | |
240 | normally has more reads than writes. Therefore it makes more sense to | |
241 | make atomic_mb_set() the more expensive operation. | |
242 | ||
243 | There are two common cases in which atomic_mb_read and atomic_mb_set | |
244 | generate too many memory barriers, and thus it can be useful to manually | |
729c0ddd | 245 | place barriers, or use atomic_load_acquire/atomic_store_release instead: |
5444e768 PB |
246 | |
247 | - when a data structure has one thread that is always a writer | |
248 | and one thread that is always a reader, manual placement of | |
249 | memory barriers makes the write side faster. Furthermore, | |
250 | correctness is easy to check for in this case using the "pairing" | |
251 | trick that is explained below: | |
252 | ||
253 | thread 1 thread 1 | |
254 | ------------------------- ------------------------ | |
255 | (other writes) | |
729c0ddd PB |
256 | atomic_mb_set(&a, x) atomic_store_release(&a, x) |
257 | atomic_mb_set(&b, y) atomic_store_release(&b, y) | |
5444e768 PB |
258 | |
259 | => | |
260 | thread 2 thread 2 | |
261 | ------------------------- ------------------------ | |
729c0ddd PB |
262 | y = atomic_mb_read(&b) y = atomic_load_acquire(&b) |
263 | x = atomic_mb_read(&a) x = atomic_load_acquire(&a) | |
264 | (other reads) | |
f1ee8696 PB |
265 | |
266 | Note that the barrier between the stores in thread 1, and between | |
267 | the loads in thread 2, has been optimized here to a write or a | |
268 | read memory barrier respectively. On some architectures, notably | |
269 | ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as | |
270 | smp_mb, but smp_rmb and/or smp_wmb are more efficient. | |
5444e768 PB |
271 | |
272 | - sometimes, a thread is accessing many variables that are otherwise | |
273 | unrelated to each other (for example because, apart from the current | |
274 | thread, exactly one other thread will read or write each of these | |
275 | variables). In this case, it is possible to "hoist" the implicit | |
276 | barriers provided by atomic_mb_read() and atomic_mb_set() outside | |
277 | a loop. For example, the above definition atomic_mb_read() gives | |
278 | the following transformation: | |
279 | ||
280 | n = 0; n = 0; | |
281 | for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++) | |
282 | n += atomic_mb_read(&a[i]); n += atomic_read(&a[i]); | |
f1ee8696 | 283 | smp_mb_acquire(); |
5444e768 PB |
284 | |
285 | Similarly, atomic_mb_set() can be transformed as follows: | |
5444e768 | 286 | |
f1ee8696 | 287 | smp_mb_release(); |
5444e768 PB |
288 | for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++) |
289 | atomic_mb_set(&a[i], false); atomic_set(&a[i], false); | |
290 | smp_mb(); | |
291 | ||
292 | ||
729c0ddd PB |
293 | The other thread can still use atomic_mb_read()/atomic_mb_set(). |
294 | ||
5444e768 PB |
295 | The two tricks can be combined. In this case, splitting a loop in |
296 | two lets you hoist the barriers out of the loops _and_ eliminate the | |
297 | expensive smp_mb(): | |
298 | ||
f1ee8696 | 299 | smp_mb_release(); |
5444e768 PB |
300 | for (i = 0; i < 10; i++) { => for (i = 0; i < 10; i++) |
301 | atomic_mb_set(&a[i], false); atomic_set(&a[i], false); | |
302 | atomic_mb_set(&b[i], false); smb_wmb(); | |
303 | } for (i = 0; i < 10; i++) | |
304 | atomic_set(&a[i], false); | |
305 | smp_mb(); | |
306 | ||
5444e768 PB |
307 | |
308 | Memory barrier pairing | |
309 | ---------------------- | |
310 | ||
311 | A useful rule of thumb is that memory barriers should always, or almost | |
312 | always, be paired with another barrier. In the case of QEMU, however, | |
313 | note that the other barrier may actually be in a driver that runs in | |
314 | the guest! | |
315 | ||
316 | For the purposes of pairing, smp_read_barrier_depends() and smp_rmb() | |
d3e4abdd | 317 | both count as read barriers. A read barrier shall pair with a write |
5444e768 PB |
318 | barrier or a full barrier; a write barrier shall pair with a read |
319 | barrier or a full barrier. A full barrier can pair with anything. | |
320 | For example: | |
321 | ||
322 | thread 1 thread 2 | |
323 | =============== =============== | |
324 | a = 1; | |
325 | smp_wmb(); | |
326 | b = 2; x = b; | |
327 | smp_rmb(); | |
328 | y = a; | |
329 | ||
d3e4abdd | 330 | Note that the "writing" thread is accessing the variables in the |
5444e768 PB |
331 | opposite order as the "reading" thread. This is expected: stores |
332 | before the write barrier will normally match the loads after the | |
333 | read barrier, and vice versa. The same is true for more than 2 | |
334 | access and for data dependency barriers: | |
335 | ||
336 | thread 1 thread 2 | |
337 | =============== =============== | |
338 | b[2] = 1; | |
339 | smp_wmb(); | |
340 | x->i = 2; | |
341 | smp_wmb(); | |
342 | a = x; x = a; | |
343 | smp_read_barrier_depends(); | |
344 | y = x->i; | |
345 | smp_read_barrier_depends(); | |
346 | z = b[y]; | |
347 | ||
f1ee8696 PB |
348 | smp_wmb() also pairs with atomic_mb_read() and smp_mb_acquire(). |
349 | and smp_rmb() also pairs with atomic_mb_set() and smp_mb_release(). | |
5444e768 PB |
350 | |
351 | ||
352 | COMPARISON WITH LINUX KERNEL MEMORY BARRIERS | |
353 | ============================================ | |
354 | ||
355 | Here is a list of differences between Linux kernel atomic operations | |
356 | and memory barriers, and the equivalents in QEMU: | |
357 | ||
358 | - atomic operations in Linux are always on a 32-bit int type and | |
359 | use a boxed atomic_t type; atomic operations in QEMU are polymorphic | |
360 | and use normal C types. | |
361 | ||
56ebe022 EC |
362 | - Originally, atomic_read and atomic_set in Linux gave no guarantee |
363 | at all. Linux 4.1 updated them to implement volatile | |
364 | semantics via ACCESS_ONCE (or the more recent READ/WRITE_ONCE). | |
365 | ||
366 | QEMU's atomic_read/set implement, if the compiler supports it, C11 | |
367 | atomic relaxed semantics, and volatile semantics otherwise. | |
368 | Both semantics prevent the compiler from doing certain transformations; | |
369 | the difference is that atomic accesses are guaranteed to be atomic, | |
370 | while volatile accesses aren't. Thus, in the volatile case we just cross | |
371 | our fingers hoping that the compiler will generate atomic accesses, | |
372 | since we assume the variables passed are machine-word sized and | |
373 | properly aligned. | |
374 | No barriers are implied by atomic_read/set in either Linux or QEMU. | |
5444e768 | 375 | |
a4a0e4b2 PB |
376 | - atomic read-modify-write operations in Linux are of three kinds: |
377 | ||
378 | atomic_OP returns void | |
379 | atomic_OP_return returns new value of the variable | |
380 | atomic_fetch_OP returns the old value of the variable | |
381 | atomic_cmpxchg returns the old value of the variable | |
382 | ||
383 | In QEMU, the second kind does not exist. Currently Linux has | |
384 | atomic_fetch_or only. QEMU provides and, or, inc, dec, add, sub. | |
5444e768 PB |
385 | |
386 | - different atomic read-modify-write operations in Linux imply | |
387 | a different set of memory barriers; in QEMU, all of them enforce | |
388 | sequential consistency, which means they imply full memory barriers | |
389 | before and after the operation. | |
390 | ||
a4a0e4b2 PB |
391 | - Linux does not have an equivalent of atomic_mb_set(). In particular, |
392 | note that smp_store_mb() is a little weaker than atomic_mb_set(). | |
393 | atomic_mb_read() compiles to the same instructions as Linux's | |
394 | smp_load_acquire(), but this should be treated as an implementation | |
729c0ddd | 395 | detail. |
5444e768 PB |
396 | |
397 | SOURCES | |
398 | ======= | |
399 | ||
400 | * Documentation/memory-barriers.txt from the Linux kernel | |
401 | ||
402 | * "The JSR-133 Cookbook for Compiler Writers", available at | |
403 | http://g.oswego.edu/dl/jmm/cookbook.html |