tools/memory-model/Documentation/recipes.txt

   1 This document provides "recipes", that is, litmus tests for commonly
   2 occurring situations, as well as a few that illustrate subtly broken but
   3 attractive nuisances.  Many of these recipes include example code from
   4 v5.7 of the Linux kernel.
   5
   6 The first section covers simple special cases, the second section
   7 takes off the training wheels to cover more involved examples,
   8 and the third section provides a few rules of thumb.
   9
  10
  11 Simple special cases
  12 ====================
  13
  14 This section presents two simple special cases, the first being where
  15 there is only one CPU or only one memory location is accessed, and the
  16 second being use of that old concurrency workhorse, locking.
  17
  18
  19 Single CPU or single memory location
  20 ------------------------------------
  21
  22 If there is only one CPU on the one hand or only one variable
  23 on the other, the code will execute in order.  There are (as
  24 usual) some things to be careful of:
  25
  26 1.      Some aspects of the C language are unordered.  For example,
  27         in the expression "f(x) + g(y)", the order in which f and g are
  28         called is not defined; the object code is allowed to use either
  29         order or even to interleave the computations.
  30
  31 2.      Compilers are permitted to use the "as-if" rule.  That is, a
  32         compiler can emit whatever code it likes for normal accesses,
  33         as long as the results of a single-threaded execution appear
  34         just as if the compiler had followed all the relevant rules.
  35         To see this, compile with a high level of optimization and run
  36         the debugger on the resulting binary.
  37
  38 3.      If there is only one variable but multiple CPUs, that variable
  39         must be properly aligned and all accesses to that variable must
  40         be full sized.  Variables that straddle cachelines or pages void
  41         your full-ordering warranty, as do undersized accesses that load
  42         from or store to only part of the variable.
  43
  44 4.      If there are multiple CPUs, accesses to shared variables should
  45         use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store
  46         tearing, load/store fusing, and invented loads and stores.
  47         There are exceptions to this rule, including:
  48
  49         i.      When there is no possibility of a given shared variable
  50                 being updated by some other CPU, for example, while
  51                 holding the update-side lock, reads from that variable
  52                 need not use READ_ONCE().
  53
  54         ii.     When there is no possibility of a given shared variable
  55                 being either read or updated by other CPUs, for example,
  56                 when running during early boot, reads from that variable
  57                 need not use READ_ONCE() and writes to that variable
  58                 need not use WRITE_ONCE().
  59
  60
  61 Locking
  62 -------
  63
  64 Locking is well-known and straightforward, at least if you don't think
  65 about it too hard.  And the basic rule is indeed quite simple: Any CPU that
  66 has acquired a given lock sees any changes previously seen or made by any
  67 CPU before it released that same lock.  Note that this statement is a bit
  68 stronger than "Any CPU holding a given lock sees all changes made by any
  69 CPU during the time that CPU was holding this same lock".  For example,
  70 consider the following pair of code fragments:
  71
  72         /* See MP+polocks.litmus. */
  73         void CPU0(void)
  74         {
  75                 WRITE_ONCE(x, 1);
  76                 spin_lock(&mylock);
  77                 WRITE_ONCE(y, 1);
  78                 spin_unlock(&mylock);
  79         }
  80
  81         void CPU1(void)
  82         {
  83                 spin_lock(&mylock);
  84                 r0 = READ_ONCE(y);
  85                 spin_unlock(&mylock);
  86                 r1 = READ_ONCE(x);
  87         }
  88
  89 The basic rule guarantees that if CPU0() acquires mylock before CPU1(),
  90 then both r0 and r1 must be set to the value 1.  This also has the
  91 consequence that if the final value of r0 is equal to 1, then the final
  92 value of r1 must also be equal to 1.  In contrast, the weaker rule would
  93 say nothing about the final value of r1.
  94
  95 The converse to the basic rule also holds, as illustrated by the
  96 following litmus test:
  97
  98         /* See MP+porevlocks.litmus. */
  99         void CPU0(void)
 100         {
 101                 r0 = READ_ONCE(y);
 102                 spin_lock(&mylock);
 103                 r1 = READ_ONCE(x);
 104                 spin_unlock(&mylock);
 105         }
 106
 107         void CPU1(void)
 108         {
 109                 spin_lock(&mylock);
 110                 WRITE_ONCE(x, 1);
 111                 spin_unlock(&mylock);
 112                 WRITE_ONCE(y, 1);
 113         }
 114
 115 This converse to the basic rule guarantees that if CPU0() acquires
 116 mylock before CPU1(), then both r0 and r1 must be set to the value 0.
 117 This also has the consequence that if the final value of r1 is equal
 118 to 0, then the final value of r0 must also be equal to 0.  In contrast,
 119 the weaker rule would say nothing about the final value of r0.
 120
 121 These examples show only a single pair of CPUs, but the effects of the
 122 locking basic rule extend across multiple acquisitions of a given lock
 123 across multiple CPUs.
 124
 125 However, it is not necessarily the case that accesses ordered by
 126 locking will be seen as ordered by CPUs not holding that lock.
 127 Consider this example:
 128
 129         /* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */
 130         void CPU0(void)
 131         {
 132                 spin_lock(&mylock);
 133                 WRITE_ONCE(x, 1);
 134                 WRITE_ONCE(y, 1);
 135                 spin_unlock(&mylock);
 136         }
 137
 138         void CPU1(void)
 139         {
 140                 spin_lock(&mylock);
 141                 r0 = READ_ONCE(y);
 142                 WRITE_ONCE(z, 1);
 143                 spin_unlock(&mylock);
 144         }
 145
 146         void CPU2(void)
 147         {
 148                 WRITE_ONCE(z, 2);
 149                 smp_mb();
 150                 r1 = READ_ONCE(x);
 151         }
 152
 153 Counter-intuitive though it might be, it is quite possible to have
 154 the final value of r0 be 1, the final value of z be 2, and the final
 155 value of r1 be 0.  The reason for this surprising outcome is that
 156 CPU2() never acquired the lock, and thus did not benefit from the
 157 lock's ordering properties.
 158
 159 Ordering can be extended to CPUs not holding the lock by careful use
 160 of smp_mb__after_spinlock():
 161
 162         /* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */
 163         void CPU0(void)
 164         {
 165                 spin_lock(&mylock);
 166                 WRITE_ONCE(x, 1);
 167                 WRITE_ONCE(y, 1);
 168                 spin_unlock(&mylock);
 169         }
 170
 171         void CPU1(void)
 172         {
 173                 spin_lock(&mylock);
 174                 smp_mb__after_spinlock();
 175                 r0 = READ_ONCE(y);
 176                 WRITE_ONCE(z, 1);
 177                 spin_unlock(&mylock);
 178         }
 179
 180         void CPU2(void)
 181         {
 182                 WRITE_ONCE(z, 2);
 183                 smp_mb();
 184                 r1 = READ_ONCE(x);
 185         }
 186
 187 This addition of smp_mb__after_spinlock() strengthens the lock acquisition
 188 sufficiently to rule out the counter-intuitive outcome.
 189
 190
 191 Taking off the training wheels
 192 ==============================
 193
 194 This section looks at more complex examples, including message passing,
 195 load buffering, release-acquire chains, store buffering.
 196 Many classes of litmus tests have abbreviated names, which may be found
 197 here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
 198
 199
 200 Message passing (MP)
 201 --------------------
 202
 203 The MP pattern has one CPU execute a pair of stores to a pair of variables
 204 and another CPU execute a pair of loads from this same pair of variables,
 205 but in the opposite order.  The goal is to avoid the counter-intuitive
 206 outcome in which the first load sees the value written by the second store
 207 but the second load does not see the value written by the first store.
 208 In the absence of any ordering, this goal may not be met, as can be seen
 209 in the MP+poonceonces.litmus litmus test.  This section therefore looks at
 210 a number of ways of meeting this goal.
 211
 212
 213 Release and acquire
 214 ~~~~~~~~~~~~~~~~~~~
 215
 216 Use of smp_store_release() and smp_load_acquire() is one way to force
 217 the desired MP ordering.  The general approach is shown below:
 218
 219         /* See MP+pooncerelease+poacquireonce.litmus. */
 220         void CPU0(void)
 221         {
 222                 WRITE_ONCE(x, 1);
 223                 smp_store_release(&y, 1);
 224         }
 225
 226         void CPU1(void)
 227         {
 228                 r0 = smp_load_acquire(&y);
 229                 r1 = READ_ONCE(x);
 230         }
 231
 232 The smp_store_release() macro orders any prior accesses against the
 233 store, while the smp_load_acquire macro orders the load against any
 234 subsequent accesses.  Therefore, if the final value of r0 is the value 1,
 235 the final value of r1 must also be the value 1.
 236
 237 The init_stack_slab() function in lib/stackdepot.c uses release-acquire
 238 in this way to safely initialize of a slab of the stack.  Working out
 239 the mutual-exclusion design is left as an exercise for the reader.
 240
 241
 242 Assign and dereference
 243 ~~~~~~~~~~~~~~~~~~~~~~
 244
 245 Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the
 246 use of smp_store_release() and smp_load_acquire(), except that both
 247 rcu_assign_pointer() and rcu_dereference() operate on RCU-protected
 248 pointers.  The general approach is shown below:
 249
 250         /* See MP+onceassign+derefonce.litmus. */
 251         int z;
 252         int *y = &z;
 253         int x;
 254
 255         void CPU0(void)
 256         {
 257                 WRITE_ONCE(x, 1);
 258                 rcu_assign_pointer(y, &x);
 259         }
 260
 261         void CPU1(void)
 262         {
 263                 rcu_read_lock();
 264                 r0 = rcu_dereference(y);
 265                 r1 = READ_ONCE(*r0);
 266                 rcu_read_unlock();
 267         }
 268
 269 In this example, if the final value of r0 is &x then the final value of
 270 r1 must be 1.
 271
 272 The rcu_assign_pointer() macro has the same ordering properties as does
 273 smp_store_release(), but the rcu_dereference() macro orders the load only
 274 against later accesses that depend on the value loaded.  A dependency
 275 is present if the value loaded determines the address of a later access
 276 (address dependency, as shown above), the value written by a later store
 277 (data dependency), or whether or not a later store is executed in the
 278 first place (control dependency).  Note that the term "data dependency"
 279 is sometimes casually used to cover both address and data dependencies.
 280
 281 In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes
 282 rcu_assign_pointer(), and the next_prime_number() function invokes
 283 rcu_dereference().  This combination mediates access to a bit vector
 284 that is expanded as additional primes are needed.
 285
 286
 287 Write and read memory barriers
 288 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 289
 290 It is usually better to use smp_store_release() instead of smp_wmb()
 291 and to use smp_load_acquire() instead of smp_rmb().  However, the older
 292 smp_wmb() and smp_rmb() APIs are still heavily used, so it is important
 293 to understand their use cases.  The general approach is shown below:
 294
 295         /* See MP+fencewmbonceonce+fencermbonceonce.litmus. */
 296         void CPU0(void)
 297         {
 298                 WRITE_ONCE(x, 1);
 299                 smp_wmb();
 300                 WRITE_ONCE(y, 1);
 301         }
 302
 303         void CPU1(void)
 304         {
 305                 r0 = READ_ONCE(y);
 306                 smp_rmb();
 307                 r1 = READ_ONCE(x);
 308         }
 309
 310 The smp_wmb() macro orders prior stores against later stores, and the
 311 smp_rmb() macro orders prior loads against later loads.  Therefore, if
 312 the final value of r0 is 1, the final value of r1 must also be 1.
 313
 314 The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
 315 the following write-side code fragment:
 316
 317         log->l_curr_block -= log->l_logBBsize;
 318         ASSERT(log->l_curr_block >= 0);
 319         smp_wmb();
 320         log->l_curr_cycle++;
 321
 322 And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains
 323 the corresponding read-side code fragment:
 324
 325         cur_cycle = READ_ONCE(log->l_curr_cycle);
 326         smp_rmb();
 327         cur_block = READ_ONCE(log->l_curr_block);
 328
 329 Alternatively, consider the following comment in function
 330 perf_output_put_handle() in kernel/events/ring_buffer.c:
 331
 332          *   kernel                             user
 333          *
 334          *   if (LOAD ->data_tail) {            LOAD ->data_head
 335          *                      (A)             smp_rmb()       (C)
 336          *      STORE $data                     LOAD $data
 337          *      smp_wmb()       (B)             smp_mb()        (D)
 338          *      STORE ->data_head               STORE ->data_tail
 339          *   }
 340
 341 The B/C pairing is an example of the MP pattern using smp_wmb() on the
 342 write side and smp_rmb() on the read side.
 343
 344 Of course, given that smp_mb() is strictly stronger than either smp_wmb()
 345 or smp_rmb(), any code fragment that would work with smp_rmb() and
 346 smp_wmb() would also work with smp_mb() replacing either or both of the
 347 weaker barriers.
 348
 349
 350 Load buffering (LB)
 351 -------------------
 352
 353 The LB pattern has one CPU load from one variable and then store to a
 354 second, while another CPU loads from the second variable and then stores
 355 to the first.  The goal is to avoid the counter-intuitive situation where
 356 each load reads the value written by the other CPU's store.  In the
 357 absence of any ordering it is quite possible that this may happen, as
 358 can be seen in the LB+poonceonces.litmus litmus test.
 359
 360 One way of avoiding the counter-intuitive outcome is through the use of a
 361 control dependency paired with a full memory barrier:
 362
 363         /* See LB+fencembonceonce+ctrlonceonce.litmus. */
 364         void CPU0(void)
 365         {
 366                 r0 = READ_ONCE(x);
 367                 if (r0)
 368                         WRITE_ONCE(y, 1);
 369         }
 370
 371         void CPU1(void)
 372         {
 373                 r1 = READ_ONCE(y);
 374                 smp_mb();
 375                 WRITE_ONCE(x, 1);
 376         }
 377
 378 This pairing of a control dependency in CPU0() with a full memory
 379 barrier in CPU1() prevents r0 and r1 from both ending up equal to 1.
 380
 381 The A/D pairing from the ring-buffer use case shown earlier also
 382 illustrates LB.  Here is a repeat of the comment in
 383 perf_output_put_handle() in kernel/events/ring_buffer.c, showing a
 384 control dependency on the kernel side and a full memory barrier on
 385 the user side:
 386
 387          *   kernel                             user
 388          *
 389          *   if (LOAD ->data_tail) {            LOAD ->data_head
 390          *                      (A)             smp_rmb()       (C)
 391          *      STORE $data                     LOAD $data
 392          *      smp_wmb()       (B)             smp_mb()        (D)
 393          *      STORE ->data_head               STORE ->data_tail
 394          *   }
 395          *
 396          * Where A pairs with D, and B pairs with C.
 397
 398 The kernel's control dependency between the load from ->data_tail
 399 and the store to data combined with the user's full memory barrier
 400 between the load from data and the store to ->data_tail prevents
 401 the counter-intuitive outcome where the kernel overwrites the data
 402 before the user gets done loading it.
 403
 404
 405 Release-acquire chains
 406 ----------------------
 407
 408 Release-acquire chains are a low-overhead, flexible, and easy-to-use
 409 method of maintaining order.  However, they do have some limitations that
 410 need to be fully understood.  Here is an example that maintains order:
 411
 412         /* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */
 413         void CPU0(void)
 414         {
 415                 WRITE_ONCE(x, 1);
 416                 smp_store_release(&y, 1);
 417         }
 418
 419         void CPU1(void)
 420         {
 421                 r0 = smp_load_acquire(y);
 422                 smp_store_release(&z, 1);
 423         }
 424
 425         void CPU2(void)
 426         {
 427                 r1 = smp_load_acquire(z);
 428                 r2 = READ_ONCE(x);
 429         }
 430
 431 In this case, if r0 and r1 both have final values of 1, then r2 must
 432 also have a final value of 1.
 433
 434 The ordering in this example is stronger than it needs to be.  For
 435 example, ordering would still be preserved if CPU1()'s smp_load_acquire()
 436 invocation was replaced with READ_ONCE().
 437
 438 It is tempting to assume that CPU0()'s store to x is globally ordered
 439 before CPU1()'s store to z, but this is not the case:
 440
 441         /* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */
 442         void CPU0(void)
 443         {
 444                 WRITE_ONCE(x, 1);
 445                 smp_store_release(&y, 1);
 446         }
 447
 448         void CPU1(void)
 449         {
 450                 r0 = smp_load_acquire(y);
 451                 smp_store_release(&z, 1);
 452         }
 453
 454         void CPU2(void)
 455         {
 456                 WRITE_ONCE(z, 2);
 457                 smp_mb();
 458                 r1 = READ_ONCE(x);
 459         }
 460
 461 One might hope that if the final value of r0 is 1 and the final value
 462 of z is 2, then the final value of r1 must also be 1, but it really is
 463 possible for r1 to have the final value of 0.  The reason, of course,
 464 is that in this version, CPU2() is not part of the release-acquire chain.
 465 This situation is accounted for in the rules of thumb below.
 466
 467 Despite this limitation, release-acquire chains are low-overhead as
 468 well as simple and powerful, at least as memory-ordering mechanisms go.
 469
 470
 471 Store buffering
 472 ---------------
 473
 474 Store buffering can be thought of as upside-down load buffering, so
 475 that one CPU first stores to one variable and then loads from a second,
 476 while another CPU stores to the second variable and then loads from the
 477 first.  Preserving order requires nothing less than full barriers:
 478
 479         /* See SB+fencembonceonces.litmus. */
 480         void CPU0(void)
 481         {
 482                 WRITE_ONCE(x, 1);
 483                 smp_mb();
 484                 r0 = READ_ONCE(y);
 485         }
 486
 487         void CPU1(void)
 488         {
 489                 WRITE_ONCE(y, 1);
 490                 smp_mb();
 491                 r1 = READ_ONCE(x);
 492         }
 493
 494 Omitting either smp_mb() will allow both r0 and r1 to have final
 495 values of 0, but providing both full barriers as shown above prevents
 496 this counter-intuitive outcome.
 497
 498 This pattern most famously appears as part of Dekker's locking
 499 algorithm, but it has a much more practical use within the Linux kernel
 500 of ordering wakeups.  The following comment taken from waitqueue_active()
 501 in include/linux/wait.h shows the canonical pattern:
 502
 503  *      CPU0 - waker                    CPU1 - waiter
 504  *
 505  *                                      for (;;) {
 506  *      @cond = true;                     prepare_to_wait(&wq_head, &wait, state);
 507  *      smp_mb();                         // smp_mb() from set_current_state()
 508  *      if (waitqueue_active(wq_head))         if (@cond)
 509  *        wake_up(wq_head);                      break;
 510  *                                        schedule();
 511  *                                      }
 512  *                                      finish_wait(&wq_head, &wait);
 513
 514 On CPU0, the store is to @cond and the load is in waitqueue_active().
 515 On CPU1, prepare_to_wait() contains both a store to wq_head and a call
 516 to set_current_state(), which contains an smp_mb() barrier; the load is
 517 "if (@cond)".  The full barriers prevent the undesirable outcome where
 518 CPU1 puts the waiting task to sleep and CPU0 fails to wake it up.
 519
 520 Note that use of locking can greatly simplify this pattern.
 521
 522
 523 Rules of thumb
 524 ==============
 525
 526 There might seem to be no pattern governing what ordering primitives are
 527 needed in which situations, but this is not the case.  There is a pattern
 528 based on the relation between the accesses linking successive CPUs in a
 529 given litmus test.  There are three types of linkage:
 530
 531 1.      Write-to-read, where the next CPU reads the value that the
 532         previous CPU wrote.  The LB litmus-test patterns contain only
 533         this type of relation.  In formal memory-modeling texts, this
 534         relation is called "reads-from" and is usually abbreviated "rf".
 535
 536 2.      Read-to-write, where the next CPU overwrites the value that the
 537         previous CPU read.  The SB litmus test contains only this type
 538         of relation.  In formal memory-modeling texts, this relation is
 539         often called "from-reads" and is sometimes abbreviated "fr".
 540
 541 3.      Write-to-write, where the next CPU overwrites the value written
 542         by the previous CPU.  The Z6.0 litmus test pattern contains a
 543         write-to-write relation between the last access of CPU1() and
 544         the first access of CPU2().  In formal memory-modeling texts,
 545         this relation is often called "coherence order" and is sometimes
 546         abbreviated "co".  In the C++ standard, it is instead called
 547         "modification order" and often abbreviated "mo".
 548
 549 The strength of memory ordering required for a given litmus test to
 550 avoid a counter-intuitive outcome depends on the types of relations
 551 linking the memory accesses for the outcome in question:
 552
 553 o       If all links are write-to-read links, then the weakest
 554         possible ordering within each CPU suffices.  For example, in
 555         the LB litmus test, a control dependency was enough to do the
 556         job.
 557
 558 o       If all but one of the links are write-to-read links, then a
 559         release-acquire chain suffices.  Both the MP and the ISA2
 560         litmus tests illustrate this case.
 561
 562 o       If more than one of the links are something other than
 563         write-to-read links, then a full memory barrier is required
 564         between each successive pair of non-write-to-read links.  This
 565         case is illustrated by the Z6.0 litmus tests, both in the
 566         locking and in the release-acquire sections.
 567
 568 However, if you find yourself having to stretch these rules of thumb
 569 to fit your situation, you should consider creating a litmus test and
 570 running it on the model.