]> git.proxmox.com Git - ceph.git/blob - ceph/src/spdk/doc/nvme.md
import 15.2.0 Octopus source
[ceph.git] / ceph / src / spdk / doc / nvme.md
1 # NVMe Driver {#nvme}
2
3 # In this document {#nvme_toc}
4
5 * @ref nvme_intro
6 * @ref nvme_examples
7 * @ref nvme_interface
8 * @ref nvme_design
9 * @ref nvme_fabrics_host
10 * @ref nvme_multi_process
11 * @ref nvme_hotplug
12
13 # Introduction {#nvme_intro}
14
15 The NVMe driver is a C library that may be linked directly into an application
16 that provides direct, zero-copy data transfer to and from
17 [NVMe SSDs](http://nvmexpress.org/). It is entirely passive, meaning that it spawns
18 no threads and only performs actions in response to function calls from the
19 application itself. The library controls NVMe devices by directly mapping the
20 [PCI BAR](https://en.wikipedia.org/wiki/PCI_configuration_space) into the local
21 process and performing [MMIO](https://en.wikipedia.org/wiki/Memory-mapped_I/O).
22 I/O is submitted asynchronously via queue pairs and the general flow isn't
23 entirely dissimilar from Linux's
24 [libaio](http://man7.org/linux/man-pages/man2/io_submit.2.html).
25
26 More recently, the library has been improved to also connect to remote NVMe
27 devices via NVMe over Fabrics. Users may now call spdk_nvme_probe() on both
28 local PCI busses and on remote NVMe over Fabrics discovery services. The API is
29 otherwise unchanged.
30
31 # Examples {#nvme_examples}
32
33 ## Getting Start with Hello World {#nvme_helloworld}
34
35 There are a number of examples provided that demonstrate how to use the NVMe
36 library. They are all in the [examples/nvme](https://github.com/spdk/spdk/tree/master/examples/nvme)
37 directory in the repository. The best place to start is
38 [hello_world](https://github.com/spdk/spdk/blob/master/examples/nvme/hello_world/hello_world.c).
39
40 ## Running Benchmarks with Fio Plugin {#nvme_fioplugin}
41
42 SPDK provides a plugin to the very popular [fio](https://github.com/axboe/fio)
43 tool for running some basic benchmarks. See the fio start up
44 [guide](https://github.com/spdk/spdk/blob/master/examples/nvme/fio_plugin/)
45 for more details.
46
47 ## Running Benchmarks with Perf Tool {#nvme_perf}
48
49 NVMe perf utility in the [examples/nvme/perf](https://github.com/spdk/spdk/tree/master/examples/nvme/perf)
50 is one of the examples which also can be used for performance tests. The fio
51 tool is widely used because it is very flexible. However, that flexibility adds
52 overhead and reduces the efficiency of SPDK. Therefore, SPDK provides a perf
53 benchmarking tool which has minimal overhead during benchmarking. We have
54 measured up to 2.6 times more IOPS/core when using perf vs. fio with the
55 4K 100% Random Read workload. The perf benchmarking tool provides several
56 run time options to support the most common workload. The following examples
57 demonstrate how to use perf.
58
59 Example: Using perf for 4K 100% Random Read workload to a local NVMe SSD for 300 seconds
60 ~~~{.sh}
61 perf -q 128 -o 4096 -w randread -r 'trtype:PCIe traddr:0000:04:00.0' -t 300
62 ~~~
63
64 Example: Using perf for 4K 100% Random Read workload to a remote NVMe SSD exported over the network via NVMe-oF
65 ~~~{.sh}
66 perf -q 128 -o 4096 -w randread -r 'trtype:RDMA adrfam:IPv4 traddr:192.168.100.8 trsvcid:4420' -t 300
67 ~~~
68
69 Example: Using perf for 4K 70/30 Random Read/Write mix workload to all local NVMe SSDs for 300 seconds
70 ~~~{.sh}
71 perf -q 128 -o 4096 -w randrw -M 70 -t 300
72 ~~~
73
74 Example: Using perf for extended LBA format CRC guard test to a local NVMe SSD,
75 users must write to the SSD before reading the LBA from SSD
76 ~~~{.sh}
77 perf -q 1 -o 4096 -w write -r 'trtype:PCIe traddr:0000:04:00.0' -t 300 -e 'PRACT=0,PRCKH=GUARD'
78 perf -q 1 -o 4096 -w read -r 'trtype:PCIe traddr:0000:04:00.0' -t 200 -e 'PRACT=0,PRCKH=GUARD'
79 ~~~
80
81 # Public Interface {#nvme_interface}
82
83 - spdk/nvme.h
84
85 Key Functions | Description
86 ------------------------------------------- | -----------
87 spdk_nvme_probe() | @copybrief spdk_nvme_probe()
88 spdk_nvme_ctrlr_alloc_io_qpair() | @copybrief spdk_nvme_ctrlr_alloc_io_qpair()
89 spdk_nvme_ctrlr_get_ns() | @copybrief spdk_nvme_ctrlr_get_ns()
90 spdk_nvme_ns_cmd_read() | @copybrief spdk_nvme_ns_cmd_read()
91 spdk_nvme_ns_cmd_readv() | @copybrief spdk_nvme_ns_cmd_readv()
92 spdk_nvme_ns_cmd_read_with_md() | @copybrief spdk_nvme_ns_cmd_read_with_md()
93 spdk_nvme_ns_cmd_write() | @copybrief spdk_nvme_ns_cmd_write()
94 spdk_nvme_ns_cmd_writev() | @copybrief spdk_nvme_ns_cmd_writev()
95 spdk_nvme_ns_cmd_write_with_md() | @copybrief spdk_nvme_ns_cmd_write_with_md()
96 spdk_nvme_ns_cmd_write_zeroes() | @copybrief spdk_nvme_ns_cmd_write_zeroes()
97 spdk_nvme_ns_cmd_dataset_management() | @copybrief spdk_nvme_ns_cmd_dataset_management()
98 spdk_nvme_ns_cmd_flush() | @copybrief spdk_nvme_ns_cmd_flush()
99 spdk_nvme_qpair_process_completions() | @copybrief spdk_nvme_qpair_process_completions()
100 spdk_nvme_ctrlr_cmd_admin_raw() | @copybrief spdk_nvme_ctrlr_cmd_admin_raw()
101 spdk_nvme_ctrlr_process_admin_completions() | @copybrief spdk_nvme_ctrlr_process_admin_completions()
102 spdk_nvme_ctrlr_cmd_io_raw() | @copybrief spdk_nvme_ctrlr_cmd_io_raw()
103 spdk_nvme_ctrlr_cmd_io_raw_with_md() | @copybrief spdk_nvme_ctrlr_cmd_io_raw_with_md()
104
105 # NVMe Driver Design {#nvme_design}
106
107 ## NVMe I/O Submission {#nvme_io_submission}
108
109 I/O is submitted to an NVMe namespace using nvme_ns_cmd_xxx functions. The NVMe
110 driver submits the I/O request as an NVMe submission queue entry on the queue
111 pair specified in the command. The function returns immediately, prior to the
112 completion of the command. The application must poll for I/O completion on each
113 queue pair with outstanding I/O to receive completion callbacks by calling
114 spdk_nvme_qpair_process_completions().
115
116 @sa spdk_nvme_ns_cmd_read, spdk_nvme_ns_cmd_write, spdk_nvme_ns_cmd_dataset_management,
117 spdk_nvme_ns_cmd_flush, spdk_nvme_qpair_process_completions
118
119 ### Scaling Performance {#nvme_scaling}
120
121 NVMe queue pairs (struct spdk_nvme_qpair) provide parallel submission paths for
122 I/O. I/O may be submitted on multiple queue pairs simultaneously from different
123 threads. Queue pairs contain no locks or atomics, however, so a given queue
124 pair may only be used by a single thread at a time. This requirement is not
125 enforced by the NVMe driver (doing so would require a lock), and violating this
126 requirement results in undefined behavior.
127
128 The number of queue pairs allowed is dictated by the NVMe SSD itself. The
129 specification allows for thousands, but most devices support between 32
130 and 128. The specification makes no guarantees about the performance available from
131 each queue pair, but in practice the full performance of a device is almost
132 always achievable using just one queue pair. For example, if a device claims to
133 be capable of 450,000 I/O per second at queue depth 128, in practice it does
134 not matter if the driver is using 4 queue pairs each with queue depth 32, or a
135 single queue pair with queue depth 128.
136
137 Given the above, the easiest threading model for an application using SPDK is
138 to spawn a fixed number of threads in a pool and dedicate a single NVMe queue
139 pair to each thread. A further improvement would be to pin each thread to a
140 separate CPU core, and often the SPDK documentation will use "CPU core" and
141 "thread" interchangeably because we have this threading model in mind.
142
143 The NVMe driver takes no locks in the I/O path, so it scales linearly in terms
144 of performance per thread as long as a queue pair and a CPU core are dedicated
145 to each new thread. In order to take full advantage of this scaling,
146 applications should consider organizing their internal data structures such
147 that data is assigned exclusively to a single thread. All operations that
148 require that data should be done by sending a request to the owning thread.
149 This results in a message passing architecture, as opposed to a locking
150 architecture, and will result in superior scaling across CPU cores.
151
152 ## NVMe Driver Internal Memory Usage {#nvme_memory_usage}
153
154 The SPDK NVMe driver provides a zero-copy data transfer path, which means that
155 there are no data buffers for I/O commands. However, some Admin commands have
156 data copies depending on the API used by the user.
157
158 Each queue pair has a number of trackers used to track commands submitted by the
159 caller. The number trackers for I/O queues depend on the users' input for queue
160 size and the value read from controller capabilities register field Maximum Queue
161 Entries Supported(MQES, 0 based value). Each tracker has a fixed size 4096 Bytes,
162 so the maximum memory used for each I/O queue is: (MQES + 1) * 4 KiB.
163
164 I/O queue pairs can be allocated in host memory, this is used for most NVMe controllers,
165 some NVMe controllers which can support Controller Memory Buffer may put I/O queue
166 pairs at controllers' PCI BAR space, SPDK NVMe driver can put I/O submission queue
167 into controller memory buffer, it depends on users' input and controller capabilities.
168 Each submission queue entry (SQE) and completion queue entry (CQE) consumes 64 bytes
169 and 16 bytes respectively. Therefore, the maximum memory used for each I/O queue
170 pair is (MQES + 1) * (64 + 16) Bytes.
171
172 # NVMe over Fabrics Host Support {#nvme_fabrics_host}
173
174 The NVMe driver supports connecting to remote NVMe-oF targets and
175 interacting with them in the same manner as local NVMe SSDs.
176
177 ## Specifying Remote NVMe over Fabrics Targets {#nvme_fabrics_trid}
178
179 The method for connecting to a remote NVMe-oF target is very similar
180 to the normal enumeration process for local PCIe-attached NVMe devices.
181 To connect to a remote NVMe over Fabrics subsystem, the user may call
182 spdk_nvme_probe() with the `trid` parameter specifying the address of
183 the NVMe-oF target.
184
185 The caller may fill out the spdk_nvme_transport_id structure manually
186 or use the spdk_nvme_transport_id_parse() function to convert a
187 human-readable string representation into the required structure.
188
189 The spdk_nvme_transport_id may contain the address of a discovery service
190 or a single NVM subsystem. If a discovery service address is specified,
191 the NVMe library will call the spdk_nvme_probe() `probe_cb` for each
192 discovered NVM subsystem, which allows the user to select the desired
193 subsystems to be attached. Alternatively, if the address specifies a
194 single NVM subsystem directly, the NVMe library will call `probe_cb`
195 for just that subsystem; this allows the user to skip the discovery step
196 and connect directly to a subsystem with a known address.
197
198 ## RDMA Limitations
199
200 Please refer to NVMe-oF target's @ref nvmf_rdma_limitations
201
202 # NVMe Multi Process {#nvme_multi_process}
203
204 This capability enables the SPDK NVMe driver to support multiple processes accessing the
205 same NVMe device. The NVMe driver allocates critical structures from shared memory, so
206 that each process can map that memory and create its own queue pairs or share the admin
207 queue. There is a limited number of I/O queue pairs per NVMe controller.
208
209 The primary motivation for this feature is to support management tools that can attach
210 to long running applications, perform some maintenance work or gather information, and
211 then detach.
212
213 ## Configuration {#nvme_multi_process_configuration}
214
215 DPDK EAL allows different types of processes to be spawned, each with different permissions
216 on the hugepage memory used by the applications.
217
218 There are two types of processes:
219 1. a primary process which initializes the shared memory and has full privileges and
220 2. a secondary process which can attach to the primary process by mapping its shared memory
221 regions and perform NVMe operations including creating queue pairs.
222
223 This feature is enabled by default and is controlled by selecting a value for the shared
224 memory group ID. This ID is a positive integer and two applications with the same shared
225 memory group ID will share memory. The first application with a given shared memory group
226 ID will be considered the primary and all others secondary.
227
228 Example: identical shm_id and non-overlapping core masks
229 ~~~{.sh}
230 ./perf options [AIO device(s)]...
231 [-c core mask for I/O submission/completion]
232 [-i shared memory group ID]
233
234 ./perf -q 1 -o 4096 -w randread -c 0x1 -t 60 -i 1
235 ./perf -q 8 -o 131072 -w write -c 0x10 -t 60 -i 1
236 ~~~
237
238 ## Limitations {#nvme_multi_process_limitations}
239
240 1. Two processes sharing memory may not share any cores in their core mask.
241 2. If a primary process exits while secondary processes are still running, those processes
242 will continue to run. However, a new primary process cannot be created.
243 3. Applications are responsible for coordinating access to logical blocks.
244 4. If a process exits unexpectedly, the allocated memory will be released when the last
245 process exits.
246
247 @sa spdk_nvme_probe, spdk_nvme_ctrlr_process_admin_completions
248
249
250 # NVMe Hotplug {#nvme_hotplug}
251
252 At the NVMe driver level, we provide the following support for Hotplug:
253
254 1. Hotplug events detection:
255 The user of the NVMe library can call spdk_nvme_probe() periodically to detect
256 hotplug events. The probe_cb, followed by the attach_cb, will be called for each
257 new device detected. The user may optionally also provide a remove_cb that will be
258 called if a previously attached NVMe device is no longer present on the system.
259 All subsequent I/O to the removed device will return an error.
260
261 2. Hot remove NVMe with IO loads:
262 When a device is hot removed while I/O is occurring, all access to the PCI BAR will
263 result in a SIGBUS error. The NVMe driver automatically handles this case by installing
264 a SIGBUS handler and remapping the PCI BAR to a new, placeholder memory location.
265 This means I/O in flight during a hot remove will complete with an appropriate error
266 code and will not crash the application.
267
268 @sa spdk_nvme_probe