]>
Commit | Line | Data |
---|---|---|
11fdf7f2 TL |
1 | # Block Device Layer Programming Guide {#bdev_pg} |
2 | ||
3 | ## Target Audience | |
4 | ||
5 | This programming guide is intended for developers authoring applications that | |
6 | use the SPDK bdev library to access block devices. | |
7 | ||
8 | ## Introduction | |
9 | ||
10 | A block device is a storage device that supports reading and writing data in | |
11 | fixed-size blocks. These blocks are usually 512 or 4096 bytes. The | |
12 | devices may be logical constructs in software or correspond to physical | |
13 | devices like NVMe SSDs. | |
14 | ||
15 | The block device layer consists of a single generic library in `lib/bdev`, | |
16 | plus a number of optional modules (as separate libraries) that implement | |
17 | various types of block devices. The public header file for the generic library | |
18 | is bdev.h, which is the entirety of the API needed to interact with any type | |
19 | of block device. This guide will cover how to interact with bdevs using that | |
20 | API. For a guide to implementing a bdev module, see @ref bdev_module. | |
21 | ||
22 | The bdev layer provides a number of useful features in addition to providing a | |
23 | common abstraction for all block devices: | |
24 | ||
25 | - Automatic queueing of I/O requests in response to queue full or out-of-memory conditions | |
26 | - Hot remove support, even while I/O traffic is occurring. | |
27 | - I/O statistics such as bandwidth and latency | |
28 | - Device reset support and I/O timeout tracking | |
29 | ||
30 | ## Basic Primitives | |
31 | ||
32 | Users of the bdev API interact with a number of basic objects. | |
33 | ||
34 | struct spdk_bdev, which this guide will refer to as a *bdev*, represents a | |
35 | generic block device. struct spdk_bdev_desc, heretofore called a *descriptor*, | |
36 | represents a handle to a given block device. Descriptors are used to establish | |
37 | and track permissions to use the underlying block device, much like a file | |
38 | descriptor on UNIX systems. Requests to the block device are asynchronous and | |
39 | represented by spdk_bdev_io objects. Requests must be submitted on an | |
40 | associated I/O channel. The motivation and design of I/O channels is described | |
41 | in @ref concurrency. | |
42 | ||
43 | Bdevs can be layered, such that some bdevs service I/O by routing requests to | |
44 | other bdevs. This can be used to implement caching, RAID, logical volume | |
45 | management, and more. Bdevs that route I/O to other bdevs are often referred | |
46 | to as virtual bdevs, or *vbdevs* for short. | |
47 | ||
48 | ## Initializing The Library | |
49 | ||
50 | The bdev layer depends on the generic message passing infrastructure | |
9f95a23c | 51 | abstracted by the header file include/spdk/thread.h. See @ref concurrency for a |
11fdf7f2 TL |
52 | full description. Most importantly, calls into the bdev library may only be |
53 | made from threads that have been allocated with SPDK by calling | |
54 | spdk_allocate_thread(). | |
55 | ||
56 | From an allocated thread, the bdev library may be initialized by calling | |
57 | spdk_bdev_initialize(), which is an asynchronous operation. Until the completion | |
58 | callback is called, no other bdev library functions may be invoked. Similarly, | |
59 | to tear down the bdev library, call spdk_bdev_finish(). | |
60 | ||
61 | ## Discovering Block Devices | |
62 | ||
63 | All block devices have a simple string name. At any time, a pointer to the | |
64 | device object can be obtained by calling spdk_bdev_get_by_name(), or the entire | |
65 | set of bdevs may be iterated using spdk_bdev_first() and spdk_bdev_next() and | |
66 | their variants. | |
67 | ||
68 | Some block devices may also be given aliases, which are also string names. | |
69 | Aliases behave like symlinks - they can be used interchangeably with the real | |
70 | name to look up the block device. | |
71 | ||
72 | ## Preparing To Use A Block Device | |
73 | ||
74 | In order to send I/O requests to a block device, it must first be opened by | |
75 | calling spdk_bdev_open(). This will return a descriptor. Multiple users may have | |
76 | a bdev open at the same time, and coordination of reads and writes between | |
77 | users must be handled by some higher level mechanism outside of the bdev | |
78 | layer. Opening a bdev with write permission may fail if a virtual bdev module | |
79 | has *claimed* the bdev. Virtual bdev modules implement logic like RAID or | |
80 | logical volume management and forward their I/O to lower level bdevs, so they | |
81 | mark these lower level bdevs as claimed to prevent outside users from issuing | |
82 | writes. | |
83 | ||
84 | When a block device is opened, an optional callback and context can be | |
85 | provided that will be called if the underlying storage servicing the block | |
86 | device is removed. For example, the remove callback will be called on each | |
87 | open descriptor for a bdev backed by a physical NVMe SSD when the NVMe SSD is | |
88 | hot-unplugged. The callback can be thought of as a request to close the open | |
89 | descriptor so other memory may be freed. A bdev cannot be torn down while open | |
90 | descriptors exist, so it is highly recommended that a callback is provided. | |
91 | ||
92 | When a user is done with a descriptor, they may release it by calling | |
93 | spdk_bdev_close(). | |
94 | ||
95 | Descriptors may be passed to and used from multiple threads simultaneously. | |
96 | However, for each thread a separate I/O channel must be obtained by calling | |
97 | spdk_bdev_get_io_channel(). This will allocate the necessary per-thread | |
98 | resources to submit I/O requests to the bdev without taking locks. To release | |
99 | a channel, call spdk_put_io_channel(). A descriptor cannot be closed until | |
100 | all associated channels have been destroyed. | |
101 | ||
102 | ## Sending I/O | |
103 | ||
104 | Once a descriptor and a channel have been obtained, I/O may be sent by calling | |
105 | the various I/O submission functions such as spdk_bdev_read(). These calls each | |
106 | take a callback as an argument which will be called some time later with a | |
107 | handle to an spdk_bdev_io object. In response to that completion, the user | |
108 | must call spdk_bdev_free_io() to release the resources. Within this callback, | |
109 | the user may also use the functions spdk_bdev_io_get_nvme_status() and | |
110 | spdk_bdev_io_get_scsi_status() to obtain error information in the format of | |
111 | their choosing. | |
112 | ||
113 | I/O submission is performed by calling functions such as spdk_bdev_read() or | |
114 | spdk_bdev_write(). These functions take as an argument a pointer to a region of | |
115 | memory or a scatter gather list describing memory that will be transferred to | |
116 | the block device. This memory must be allocated through spdk_dma_malloc() or | |
117 | its variants. For a full explanation of why the memory must come from a | |
118 | special allocation pool, see @ref memory. Where possible, data in memory will | |
119 | be *directly transferred to the block device* using | |
120 | [Direct Memory Access](https://en.wikipedia.org/wiki/Direct_memory_access). | |
121 | That means it is not copied. | |
122 | ||
123 | All I/O submission functions are asynchronous and non-blocking. They will not | |
124 | block or stall the thread for any reason. However, the I/O submission | |
125 | functions may fail in one of two ways. First, they may fail immediately and | |
126 | return an error code. In that case, the provided callback will not be called. | |
127 | Second, they may fail asynchronously. In that case, the associated | |
128 | spdk_bdev_io will be passed to the callback and it will report error | |
129 | information. | |
130 | ||
131 | Some I/O request types are optional and may not be supported by a given bdev. | |
132 | To query a bdev for the I/O request types it supports, call | |
133 | spdk_bdev_io_type_supported(). | |
134 | ||
135 | ## Resetting A Block Device | |
136 | ||
137 | In order to handle unexpected failure conditions, the bdev library provides a | |
138 | mechanism to perform a device reset by calling spdk_bdev_reset(). This will pass | |
139 | a message to every other thread for which an I/O channel exists for the bdev, | |
140 | pause it, then forward a reset request to the underlying bdev module and wait | |
141 | for completion. Upon completion, the I/O channels will resume and the reset | |
142 | will complete. The specific behavior inside the bdev module is | |
143 | module-specific. For example, NVMe devices will delete all queue pairs, | |
144 | perform an NVMe reset, then recreate the queue pairs and continue. Most | |
145 | importantly, regardless of device type, *all I/O outstanding to the block | |
146 | device will be completed prior to the reset completing.* |