ceph/src/spdk/doc/userspace.md

   1 # User Space Drivers {#userspace}
   2
   3 # Controlling Hardware From User Space {#userspace_control}
   4
   5 Much of the documentation for SPDK talks about _user space drivers_, so it's
   6 important to understand what that means at a technical level. First and
   7 foremost, a _driver_ is software that directly controls a particular device
   8 attached to a computer. Second, operating systems segregate the system's
   9 virtual memory into two categories of addresses based on privilege level -
  10 [kernel space and user space](https://en.wikipedia.org/wiki/User_space). This
  11 separation is aided by features on the CPU itself that enforce memory
  12 separation called
  13 [protection rings](https://en.wikipedia.org/wiki/Protection_ring). Typically,
  14 drivers run in kernel space (i.e. ring 0 on x86). SPDK contains drivers that
  15 instead are designed to run in user space, but they still interface directly
  16 with the hardware device that they are controlling.
  17
  18 In order for SPDK to take control of a device, it must first instruct the
  19 operating system to relinquish control. This is often referred to as unbinding
  20 the kernel driver from the device and on Linux is done by
  21 [writing to a file in sysfs](https://lwn.net/Articles/143397/).
  22 SPDK then rebinds the driver to one of two special device drivers that come
  23 bundled with Linux -
  24 [uio](https://www.kernel.org/doc/html/latest/driver-api/uio-howto.html) or
  25 [vfio](https://www.kernel.org/doc/Documentation/vfio.txt). These two drivers
  26 are "dummy" drivers in the sense that they mostly indicate to the operating
  27 system that the device has a driver bound to it so it won't automatically try
  28 to re-bind the default driver. They don't actually initialize the hardware in
  29 any way, nor do they even understand what type of device it is. The primary
  30 difference between uio and vfio is that vfio is capable of programming the
  31 platform's
  32 [IOMMU](https://en.wikipedia.org/wiki/Input%E2%80%93output_memory_management_unit),
  33 which is a critical piece of hardware for ensuring memory safety in user space
  34 drivers. See @ref memory for full details.
  35
  36 Once the device is unbound from the operating system kernel, the operating
  37 system can't use it anymore. For example, if you unbind an NVMe device on Linux,
  38 the devices corresponding to it such as /dev/nvme0n1 will disappear. It further
  39 means that filesystems mounted on the device will also be removed and kernel
  40 filesystems can no longer interact with the device. In fact, the entire kernel
  41 block storage stack is no longer involved. Instead, SPDK provides re-imagined
  42 implementations of most of the layers in a typical operating system storage
  43 stack all as C libraries that can be directly embedded into your application.
  44 This includes a [block device abstraction layer](@ref bdev) primarily, but
  45 also [block allocators](@ref blob) and [filesystem-like components](@ref blobfs).
  46
  47 User space drivers utilize features in uio or vfio to map the
  48 [PCI BAR](https://en.wikipedia.org/wiki/PCI_configuration_space) for the device
  49 into the current process, which allows the driver to perform
  50 [MMIO](https://en.wikipedia.org/wiki/Memory-mapped_I/O) directly. The SPDK @ref
  51 nvme, for instance, maps the BAR for the NVMe device and then follows along
  52 with the
  53 [NVMe Specification](http://nvmexpress.org/wp-content/uploads/NVM_Express_Revision_1.3.pdf)
  54 to initialize the device, create queue pairs, and ultimately send I/O.
  55
  56 # Interrupts {#userspace_interrupts}
  57
  58 SPDK polls devices for completions instead of waiting for interrupts. There
  59 are a number of reasons for doing this: 1) practically speaking, routing an
  60 interrupt to a handler in a user space process just isn't feasible for most
  61 hardware designs, 2) interrupts introduce software jitter and have significant
  62 overhead due to forced context switches. Operations in SPDK are almost
  63 universally asynchronous and allow the user to provide a callback on
  64 completion. The callback is called in response to the user calling a function
  65 to poll for completions. Polling an NVMe device is fast because only host
  66 memory needs to be read (no MMIO) to check a queue pair for a bit flip and
  67 technologies such as Intel's
  68 [DDIO](https://www.intel.com/content/www/us/en/io/data-direct-i-o-technology.html)
  69 will ensure that the host memory being checked is present in the CPU cache
  70 after an update by the device.
  71
  72 # Threading {#userspace_threading}
  73
  74 NVMe devices expose multiple queues for submitting requests to the hardware.
  75 Separate queues can be accessed without coordination, so software can send
  76 requests to the device from multiple threads of execution in parallel without
  77 locks. Unfortunately, kernel drivers must be designed to handle I/O coming
  78 from lots of different places either in the operating system or in various
  79 processes on the system, and the thread topology of those processes changes
  80 over time. Most kernel drivers elect to map hardware queues to cores (as close
  81 to 1:1 as possible), and then when a request is submitted they look up the
  82 correct hardware queue for whatever core the current thread happens to be
  83 running on. Often, they'll need to either acquire a lock around the queue or
  84 temporarily disable interrupts to guard against preemption from threads
  85 running on the same core, which can be expensive. This is a large improvement
  86 from older hardware interfaces that only had a single queue or no queue at
  87 all, but still isn't always optimal.
  88
  89 A user space driver, on the other hand, is embedded into a single application.
  90 This application knows exactly how many threads (or processes) exist
  91 because the application created them. Therefore, the SPDK drivers choose to
  92 expose the hardware queues directly to the application with the requirement
  93 that a hardware queue is only ever accessed from one thread at a time. In
  94 practice, applications assign one hardware queue to each thread (as opposed to
  95 one hardware queue per core in kernel drivers). This guarantees that the thread
  96 can submit requests without having to perform any sort of coordination (i.e.
  97 locking) with the other threads in the system.