ceph/src/spdk/doc/concurrency.md

   1 # Message Passing and Concurrency {#concurrency}
   2
   3 # Theory
   4
   5 One of the primary aims of SPDK is to scale linearly with the addition of
   6 hardware. This can mean a number of things in practice. For instance, moving
   7 from one SSD to two should double the number of I/O's per second. Or doubling
   8 the number of CPU cores should double the amount of computation possible. Or
   9 even doubling the number of NICs should double the network throughput. To
  10 achieve this, the software must be designed such that threads of execution are
  11 independent from one another as much as possible. In practice, that means
  12 avoiding software locks and even atomic instructions.
  13
  14 Traditionally, software achieves concurrency by placing some shared data onto
  15 the heap, protecting it with a lock, and then having all threads of execution
  16 acquire the lock only when that shared data needs to be accessed. This model
  17 has a number of great properties:
  18
  19 * It's relatively easy to convert single-threaded programs to multi-threaded
  20 programs because you don't have to change the data model from the
  21 single-threaded version. You just add a lock around the data.
  22 * You can write your program as a synchronous, imperative list of statements
  23 that you read from top to bottom.
  24 * Your threads can be interrupted and put to sleep by the operating system
  25 scheduler behind the scenes, allowing for efficient time-sharing of CPU resources.
  26
  27 Unfortunately, as the number of threads scales up, contention on the lock
  28 around the shared data does too. More granular locking helps, but then also
  29 greatly increases the complexity of the program. Even then, beyond a certain
  30 number highly contended locks, threads will spend most of their time
  31 attempting to acquire the locks and the program will not benefit from any
  32 additional CPU cores.
  33
  34 SPDK takes a different approach altogether. Instead of placing shared data in a
  35 global location that all threads access after acquiring a lock, SPDK will often
  36 assign that data to a single thread. When other threads want to access the
  37 data, they pass a message to the owning thread to perform the operation on
  38 their behalf. This strategy, of course, is not at all new. For instance, it is
  39 one of the core design principles of
  40 [Erlang](http://erlang.org/download/armstrong_thesis_2003.pdf) and is the main
  41 concurrency mechanism in [Go](https://tour.golang.org/concurrency/2). A message
  42 in SPDK typically consists of a function pointer and a pointer to some context,
  43 and is passed between threads using a
  44 [lockless ring](http://dpdk.org/doc/guides/prog_guide/ring_lib.html). Message
  45 passing is often much faster than most software developer's intuition leads them to
  46 believe, primarily due to caching effects. If a single core is consistently
  47 accessing the same data (on behalf of all of the other cores), then that data
  48 is far more likely to be in a cache closer to that core. It's often most
  49 efficient to have each core work on a relatively small set of data sitting in
  50 its local cache and then hand off a small message to the next core when done.
  51
  52 In more extreme cases where even message passing may be too costly, a copy of
  53 the data will be made for each thread. The thread will then only reference its
  54 local copy. To mutate the data, threads will send a message to each other
  55 thread telling them to perform the update on their local copy. This is great
  56 when the data isn't mutated very often, but may be read very frequently, and is
  57 often employed in the I/O path. This of course trades memory size for
  58 computational efficiency, so it's use is limited to only the most critical code
  59 paths.
  60
  61 # Message Passing Infrastructure
  62
  63 SPDK provides several layers of message passing infrastructure. The most
  64 fundamental libraries in SPDK, for instance, don't do any message passing on
  65 their own and instead enumerate rules about when functions may be called in
  66 their documentation (e.g. @ref nvme). Most libraries, however, depend on SPDK's
  67 [thread](http://www.spdk.io/doc/thread_8h.html)
  68 abstraction, located in `libspdk_thread.a`. The thread abstraction provides a
  69 basic message passing framework and defines a few key primitives.
  70
  71 First, spdk_thread is an abstraction for a thread of execution and
  72 spdk_poller is an abstraction for a function that should be
  73 periodically called on the given thread. On each system thread that the user
  74 wishes to use with SPDK, they must first call spdk_allocate_thread(). This
  75 function takes three function pointers - one that will be called to pass a
  76 message to this thread, one that will be called to request that a poller be
  77 started on this thread, and finally one to request that a poller be stopped.
  78 *The implementation of these functions is not provided by this library*. Many
  79 applications already have facilities for passing messages, so to ease
  80 integration with existing code bases we've left the implementation up to the
  81 user. However, for users starting from scratch, see the following section on
  82 the event framework for an SPDK-provided implementation.
  83
  84 The library also defines two other abstractions: spdk_io_device and
  85 spdk_io_channel. In the course of implementing SPDK we noticed the
  86 same pattern emerging in a number of different libraries. In order to
  87 implement a message passing strategy, the code would describe some object with
  88 global state and also some per-thread context associated with that object that
  89 was accessed in the I/O path to avoid locking on the global state. The pattern
  90 was clearest in the lowest layers where I/O was being submitted to block
  91 devices. These devices often expose multiple queues that can be assigned to
  92 threads and then accessed without a lock to submit I/O. To abstract that, we
  93 generalized the device to spdk_io_device and the thread-specific queue to
  94 spdk_io_channel. Over time, however, the pattern has appeared in a huge
  95 number of places that don't fit quite so nicely with the names we originally
  96 chose. In today's code spdk_io_device is any pointer, whose uniqueness is
  97 predicated only on its memory address, and spdk_io_channel is the per-thread
  98 context associated with a particular spdk_io_device.
  99
 100 The threading abstraction provides functions to send a message to any other
 101 thread, to send a message to all threads one by one, and to send a message to
 102 all threads for which there is an io_channel for a given io_device.
 103
 104 # The event Framework
 105
 106 As the number of example applications in SPDK grew, it became clear that a
 107 large portion of the code in each was implementing the basic message passing
 108 infrastructure required to call spdk_allocate_thread(). This includes spawning
 109 one thread per core, pinning each thread to a unique core, and allocating
 110 lockless rings between the threads for message passing. Instead of
 111 re-implementing that infrastructure for each example application, SPDK
 112 provides the SPDK @ref event. This library handles setting up all of the
 113 message passing infrastructure, installing signal handlers to cleanly
 114 shutdown, implements periodic pollers, and does basic command line parsing.
 115 When started through spdk_app_start(), the library automatically spawns all of
 116 the threads requested, pins them, and calls spdk_allocate_thread() with
 117 appropriate function pointers for each one. This makes it much easier to
 118 implement a brand new SPDK application and is the recommended method for those
 119 starting out. Only established applications with sufficient message passing
 120 infrastructure should consider directly integrating the lower level libraries.
 121
 122 # Limitations of the C Language
 123
 124 Message passing is efficient, but it results in asynchronous code.
 125 Unfortunately, asynchronous code is a challenge in C. It's often implemented by
 126 passing function pointers that are called when an operation completes. This
 127 chops up the code so that it isn't easy to follow, especially through logic
 128 branches. The best solution is to use a language with support for
 129 [futures and promises](https://en.wikipedia.org/wiki/Futures_and_promises),
 130 such as C++, Rust, Go, or almost any other higher level language. However, SPDK is a low
 131 level library and requires very wide compatibility and portability, so we've
 132 elected to stay with plain old C.
 133
 134 We do have a few recommendations to share, though. For _simple_ callback chains,
 135 it's easiest if you write the functions from bottom to top. By that we mean if
 136 function `foo` performs some asynchronous operation and when that completes
 137 function `bar` is called, then function `bar` performs some operation that
 138 calls function `baz` on completion, a good way to write it is as such:
 139
 140     void baz(void *ctx) {
 141             ...
 142     }
 143
 144     void bar(void *ctx) {
 145             async_op(baz, ctx);
 146     }
 147
 148     void foo(void *ctx) {
 149             async_op(bar, ctx);
 150     }
 151
 152 Don't split these functions up - keep them as a nice unit that can be read from bottom to top.
 153
 154 For more complex callback chains, especially ones that have logical branches
 155 or loops, it's best to write out a state machine. It turns out that higher
 156 level languages that support futures and promises are just generating state
 157 machines at compile time, so even though we don't have the ability to generate
 158 them in C we can still write them out by hand. As an example, here's a
 159 callback chain that performs `foo` 5 times and then calls `bar` - effectively
 160 an asynchronous for loop.
 161
 162     enum states {
 163             FOO_START = 0,
 164             FOO_END,
 165             BAR_START,
 166             BAR_END
 167     };
 168
 169     struct state_machine {
 170             enum states state;
 171
 172             int count;
 173     };
 174
 175     static void
 176     foo_complete(void *ctx)
 177     {
 178         struct state_machine *sm = ctx;
 179
 180         sm->state = FOO_END;
 181         run_state_machine(sm);
 182     }
 183
 184     static void
 185     foo(struct state_machine *sm)
 186     {
 187         do_async_op(foo_complete, sm);
 188     }
 189
 190     static void
 191     bar_complete(void *ctx)
 192     {
 193         struct state_machine *sm = ctx;
 194
 195         sm->state = BAR_END;
 196         run_state_machine(sm);
 197     }
 198
 199     static void
 200     bar(struct state_machine *sm)
 201     {
 202         do_async_op(bar_complete, sm);
 203     }
 204
 205     static void
 206     run_state_machine(struct state_machine *sm)
 207     {
 208         enum states prev_state;
 209
 210         do {
 211             prev_state = sm->state;
 212
 213             switch (sm->state) {
 214                 case FOO_START:
 215                     foo(sm);
 216                     break;
 217                 case FOO_END:
 218                     /* This is the loop condition */
 219                     if (sm->count++ < 5) {
 220                         sm->state = FOO_START;
 221                     } else {
 222                         sm->state = BAR_START;
 223                     }
 224                     break;
 225                 case BAR_START:
 226                     bar(sm);
 227                     break;
 228                 case BAR_END:
 229                     break;
 230             }
 231         } while (prev_state != sm->state);
 232     }
 233
 234     void do_async_for(void)
 235     {
 236             struct state_machine *sm;
 237
 238             sm = malloc(sizeof(*sm));
 239             sm->state = FOO_START;
 240             sm->count = 0;
 241
 242             run_state_machine(sm);
 243     }
 244
 245 This is complex, of course, but the `run_state_machine` function can be read
 246 from top to bottom to get a clear overview of what's happening in the code
 247 without having to chase through each of the callbacks.