]>
Commit | Line | Data |
---|---|---|
a3d9f3a9 KJ |
1 | ============== |
2 | NVMe Emulation | |
3 | ============== | |
4 | ||
5 | QEMU provides NVMe emulation through the ``nvme``, ``nvme-ns`` and | |
6 | ``nvme-subsys`` devices. | |
7 | ||
8 | See the following sections for specific information on | |
9 | ||
10 | * `Adding NVMe Devices`_, `additional namespaces`_ and `NVM subsystems`_. | |
11 | * Configuration of `Optional Features`_ such as `Controller Memory Buffer`_, | |
12 | `Simple Copy`_, `Zoned Namespaces`_, `metadata`_ and `End-to-End Data | |
13 | Protection`_, | |
14 | ||
15 | Adding NVMe Devices | |
16 | =================== | |
17 | ||
18 | Controller Emulation | |
19 | -------------------- | |
20 | ||
21 | The QEMU emulated NVMe controller implements version 1.4 of the NVM Express | |
22 | specification. All mandatory features are implement with a couple of exceptions | |
23 | and limitations: | |
24 | ||
25 | * Accounting numbers in the SMART/Health log page are reset when the device | |
26 | is power cycled. | |
27 | * Interrupt Coalescing is not supported and is disabled by default. | |
28 | ||
29 | The simplest way to attach an NVMe controller on the QEMU PCI bus is to add the | |
30 | following parameters: | |
31 | ||
32 | .. code-block:: console | |
33 | ||
34 | -drive file=nvm.img,if=none,id=nvm | |
35 | -device nvme,serial=deadbeef,drive=nvm | |
36 | ||
37 | There are a number of optional general parameters for the ``nvme`` device. Some | |
38 | are mentioned here, but see ``-device nvme,help`` to list all possible | |
39 | parameters. | |
40 | ||
41 | ``max_ioqpairs=UINT32`` (default: ``64``) | |
42 | Set the maximum number of allowed I/O queue pairs. This replaces the | |
43 | deprecated ``num_queues`` parameter. | |
44 | ||
45 | ``msix_qsize=UINT16`` (default: ``65``) | |
46 | The number of MSI-X vectors that the device should support. | |
47 | ||
48 | ``mdts=UINT8`` (default: ``7``) | |
49 | Set the Maximum Data Transfer Size of the device. | |
50 | ||
51 | ``use-intel-id`` (default: ``off``) | |
52 | Since QEMU 5.2, the device uses a QEMU allocated "Red Hat" PCI Device and | |
53 | Vendor ID. Set this to ``on`` to revert to the unallocated Intel ID | |
54 | previously used. | |
55 | ||
56 | Additional Namespaces | |
57 | --------------------- | |
58 | ||
59 | In the simplest possible invocation sketched above, the device only support a | |
60 | single namespace with the namespace identifier ``1``. To support multiple | |
61 | namespaces and additional features, the ``nvme-ns`` device must be used. | |
62 | ||
63 | .. code-block:: console | |
64 | ||
65 | -device nvme,id=nvme-ctrl-0,serial=deadbeef | |
66 | -drive file=nvm-1.img,if=none,id=nvm-1 | |
67 | -device nvme-ns,drive=nvm-1 | |
68 | -drive file=nvm-2.img,if=none,id=nvm-2 | |
69 | -device nvme-ns,drive=nvm-2 | |
70 | ||
71 | The namespaces defined by the ``nvme-ns`` device will attach to the most | |
72 | recently defined ``nvme-bus`` that is created by the ``nvme`` device. Namespace | |
b980c1ae | 73 | identifiers are allocated automatically, starting from ``1``. |
a3d9f3a9 KJ |
74 | |
75 | There are a number of parameters available: | |
76 | ||
77 | ``nsid`` (default: ``0``) | |
78 | Explicitly set the namespace identifier. | |
79 | ||
80 | ``uuid`` (default: *autogenerated*) | |
81 | Set the UUID of the namespace. This will be reported as a "Namespace UUID" | |
82 | descriptor in the Namespace Identification Descriptor List. | |
83 | ||
6870cfb8 HS |
84 | ``eui64`` |
85 | Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended | |
86 | Unique Identifier" descriptor in the Namespace Identification Descriptor List. | |
3276dde4 HS |
87 | Since machine type 6.1 a non-zero default value is used if the parameter |
88 | is not provided. For earlier machine types the field defaults to 0. | |
6870cfb8 | 89 | |
a3d9f3a9 KJ |
90 | ``bus`` |
91 | If there are more ``nvme`` devices defined, this parameter may be used to | |
92 | attach the namespace to a specific ``nvme`` device (identified by an ``id`` | |
93 | parameter on the controller device). | |
94 | ||
95 | NVM Subsystems | |
96 | -------------- | |
97 | ||
98 | Additional features becomes available if the controller device (``nvme``) is | |
99 | linked to an NVM Subsystem device (``nvme-subsys``). | |
100 | ||
101 | The NVM Subsystem emulation allows features such as shared namespaces and | |
102 | multipath I/O. | |
103 | ||
104 | .. code-block:: console | |
105 | ||
106 | -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0 | |
146b5fa5 NC |
107 | -device nvme,serial=deadbeef,subsys=nvme-subsys-0 |
108 | -device nvme,serial=deadbeef,subsys=nvme-subsys-0 | |
a3d9f3a9 KJ |
109 | |
110 | This will create an NVM subsystem with two controllers. Having controllers | |
111 | linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters: | |
112 | ||
916b0f0b | 113 | ``shared`` (default: ``on`` since 6.2) |
a3d9f3a9 | 114 | Specifies that the namespace will be attached to all controllers in the |
916b0f0b KJ |
115 | subsystem. If set to ``off``, the namespace will remain a private namespace |
116 | and may only be attached to a single controller at a time. Shared namespaces | |
117 | are always automatically attached to all controllers (also when controllers | |
118 | are hotplugged). | |
a3d9f3a9 KJ |
119 | |
120 | ``detached`` (default: ``off``) | |
121 | If set to ``on``, the namespace will be be available in the subsystem, but | |
916b0f0b KJ |
122 | not attached to any controllers initially. A shared namespace with this set |
123 | to ``on`` will never be automatically attached to controllers. | |
a3d9f3a9 KJ |
124 | |
125 | Thus, adding | |
126 | ||
127 | .. code-block:: console | |
128 | ||
129 | -drive file=nvm-1.img,if=none,id=nvm-1 | |
916b0f0b | 130 | -device nvme-ns,drive=nvm-1,nsid=1 |
a3d9f3a9 | 131 | -drive file=nvm-2.img,if=none,id=nvm-2 |
916b0f0b | 132 | -device nvme-ns,drive=nvm-2,nsid=3,shared=off,detached=on |
a3d9f3a9 | 133 | |
916b0f0b KJ |
134 | will cause NSID 1 will be a shared namespace that is initially attached to both |
135 | controllers. NSID 3 will be a private namespace due to ``shared=off`` and only | |
136 | attachable to a single controller at a time. Additionally it will not be | |
137 | attached to any controller initially (due to ``detached=on``) or to hotplugged | |
138 | controllers. | |
a3d9f3a9 KJ |
139 | |
140 | Optional Features | |
141 | ================= | |
142 | ||
143 | Controller Memory Buffer | |
144 | ------------------------ | |
145 | ||
146 | ``nvme`` device parameters related to the Controller Memory Buffer support: | |
147 | ||
148 | ``cmb_size_mb=UINT32`` (default: ``0``) | |
149 | This adds a Controller Memory Buffer of the given size at offset zero in BAR | |
150 | 2. | |
151 | ||
152 | ``legacy-cmb`` (default: ``off``) | |
153 | By default, the device uses the "v1.4 scheme" for the Controller Memory | |
154 | Buffer support (i.e, the CMB is initially disabled and must be explicitly | |
155 | enabled by the host). Set this to ``on`` to behave as a v1.3 device wrt. the | |
156 | CMB. | |
157 | ||
158 | Simple Copy | |
159 | ----------- | |
160 | ||
161 | The device includes support for TP 4065 ("Simple Copy Command"). A number of | |
162 | additional ``nvme-ns`` device parameters may be used to control the Copy | |
163 | command limits: | |
164 | ||
165 | ``mssrl=UINT16`` (default: ``128``) | |
166 | Set the Maximum Single Source Range Length (``MSSRL``). This is the maximum | |
167 | number of logical blocks that may be specified in each source range. | |
168 | ||
169 | ``mcl=UINT32`` (default: ``128``) | |
170 | Set the Maximum Copy Length (``MCL``). This is the maximum number of logical | |
171 | blocks that may be specified in a Copy command (the total for all source | |
172 | ranges). | |
173 | ||
174 | ``msrc=UINT8`` (default: ``127``) | |
175 | Set the Maximum Source Range Count (``MSRC``). This is the maximum number of | |
176 | source ranges that may be used in a Copy command. This is a 0's based value. | |
177 | ||
178 | Zoned Namespaces | |
179 | ---------------- | |
180 | ||
181 | A namespaces may be "Zoned" as defined by TP 4053 ("Zoned Namespaces"). Set | |
182 | ``zoned=on`` on an ``nvme-ns`` device to configure it as a zoned namespace. | |
183 | ||
184 | The namespace may be configured with additional parameters | |
185 | ||
186 | ``zoned.zone_size=SIZE`` (default: ``128MiB``) | |
187 | Define the zone size (``ZSZE``). | |
188 | ||
189 | ``zoned.zone_capacity=SIZE`` (default: ``0``) | |
190 | Define the zone capacity (``ZCAP``). If left at the default (``0``), the zone | |
191 | capacity will equal the zone size. | |
192 | ||
193 | ``zoned.descr_ext_size=UINT32`` (default: ``0``) | |
194 | Set the Zone Descriptor Extension Size (``ZDES``). Must be a multiple of 64 | |
195 | bytes. | |
196 | ||
197 | ``zoned.cross_read=BOOL`` (default: ``off``) | |
198 | Set to ``on`` to allow reads to cross zone boundaries. | |
199 | ||
200 | ``zoned.max_active=UINT32`` (default: ``0``) | |
201 | Set the maximum number of active resources (``MAR``). The default (``0``) | |
202 | allows all zones to be active. | |
203 | ||
204 | ``zoned.max_open=UINT32`` (default: ``0``) | |
205 | Set the maximum number of open resources (``MOR``). The default (``0``) | |
206 | allows all zones to be open. If ``zoned.max_active`` is specified, this value | |
207 | must be less than or equal to that. | |
208 | ||
176c0a49 KB |
209 | ``zoned.zasl=UINT8`` (default: ``0``) |
210 | Set the maximum data transfer size for the Zone Append command. Like | |
211 | ``mdts``, the value is specified as a power of two (2^n) and is in units of | |
212 | the minimum memory page size (CAP.MPSMIN). The default value (``0``) | |
213 | has this property inherit the ``mdts`` value. | |
214 | ||
e409c905 KJ |
215 | Flexible Data Placement |
216 | ----------------------- | |
217 | ||
218 | The device may be configured to support TP4146 ("Flexible Data Placement") by | |
219 | configuring it (``fdp=on``) on the subsystem:: | |
220 | ||
221 | -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0,fdp=on,fdp.nruh=16 | |
222 | ||
223 | The subsystem emulates a single Endurance Group, on which Flexible Data | |
224 | Placement will be supported. Also note that the device emulation deviates | |
225 | slightly from the specification, by always enabling the "FDP Mode" feature on | |
226 | the controller if the subsystems is configured for Flexible Data Placement. | |
227 | ||
228 | Enabling Flexible Data Placement on the subsyste enables the following | |
229 | parameters: | |
230 | ||
231 | ``fdp.nrg`` (default: ``1``) | |
232 | Set the number of Reclaim Groups. | |
233 | ||
234 | ``fdp.nruh`` (default: ``0``) | |
235 | Set the number of Reclaim Unit Handles. This is a mandatory paramater and | |
236 | must be non-zero. | |
237 | ||
238 | ``fdp.runs`` (default: ``96M``) | |
239 | Set the Reclaim Unit Nominal Size. Defaults to 96 MiB. | |
240 | ||
241 | Namespaces within this subsystem may requests Reclaim Unit Handles:: | |
242 | ||
243 | -device nvme-ns,drive=nvm-1,fdp.ruhs=RUHLIST | |
244 | ||
245 | The ``RUHLIST`` is a semicolon separated list (i.e. ``0;1;2;3``) and may | |
246 | include ranges (i.e. ``0;8-15``). If no reclaim unit handle list is specified, | |
247 | the controller will assign the controller-specified reclaim unit handle to | |
248 | placement handle identifier 0. | |
249 | ||
a3d9f3a9 KJ |
250 | Metadata |
251 | -------- | |
252 | ||
253 | The virtual namespace device supports LBA metadata in the form separate | |
254 | metadata (``MPTR``-based) and extended LBAs. | |
255 | ||
256 | ``ms=UINT16`` (default: ``0``) | |
257 | Defines the number of metadata bytes per LBA. | |
258 | ||
259 | ``mset=UINT8`` (default: ``0``) | |
260 | Set to ``1`` to enable extended LBAs. | |
261 | ||
262 | End-to-End Data Protection | |
263 | -------------------------- | |
264 | ||
265 | The virtual namespace device supports DIF- and DIX-based protection information | |
266 | (depending on ``mset``). | |
267 | ||
268 | ``pi=UINT8`` (default: ``0``) | |
269 | Enable protection information of the specified type (type ``1``, ``2`` or | |
270 | ``3``). | |
271 | ||
272 | ``pil=UINT8`` (default: ``0``) | |
273 | Controls the location of the protection information within the metadata. Set | |
274 | to ``1`` to transfer protection information as the first eight bytes of | |
275 | metadata. Otherwise, the protection information is transferred as the last | |
276 | eight bytes. | |
751babf5 LM |
277 | |
278 | Virtualization Enhancements and SR-IOV (Experimental Support) | |
279 | ------------------------------------------------------------- | |
280 | ||
281 | The ``nvme`` device supports Single Root I/O Virtualization and Sharing | |
282 | along with Virtualization Enhancements. The controller has to be linked to | |
283 | an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV. | |
284 | ||
285 | A number of parameters are present (**please note, that they may be | |
286 | subject to change**): | |
287 | ||
288 | ``sriov_max_vfs`` (default: ``0``) | |
289 | Indicates the maximum number of PCIe virtual functions supported | |
290 | by the controller. Specifying a non-zero value enables reporting of both | |
291 | SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities | |
292 | by the NVMe device. Virtual function controllers will not report SR-IOV. | |
293 | ||
294 | ``sriov_vq_flexible`` | |
295 | Indicates the total number of flexible queue resources assignable to all | |
296 | the secondary controllers. Implicitly sets the number of primary | |
297 | controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``. | |
298 | ||
299 | ``sriov_vi_flexible`` | |
300 | Indicates the total number of flexible interrupt resources assignable to | |
301 | all the secondary controllers. Implicitly sets the number of primary | |
302 | controller's private resources to ``(msix_qsize - sriov_vi_flexible)``. | |
303 | ||
304 | ``sriov_max_vi_per_vf`` (default: ``0``) | |
305 | Indicates the maximum number of virtual interrupt resources assignable | |
306 | to a secondary controller. The default ``0`` resolves to | |
307 | ``(sriov_vi_flexible / sriov_max_vfs)`` | |
308 | ||
309 | ``sriov_max_vq_per_vf`` (default: ``0``) | |
310 | Indicates the maximum number of virtual queue resources assignable to | |
311 | a secondary controller. The default ``0`` resolves to | |
312 | ``(sriov_vq_flexible / sriov_max_vfs)`` | |
313 | ||
314 | The simplest possible invocation enables the capability to set up one VF | |
315 | controller and assign an admin queue, an IO queue, and a MSI-X interrupt. | |
316 | ||
317 | .. code-block:: console | |
318 | ||
319 | -device nvme-subsys,id=subsys0 | |
320 | -device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1, | |
321 | sriov_vq_flexible=2,sriov_vi_flexible=1 | |
322 | ||
323 | The minimum steps required to configure a functional NVMe secondary | |
324 | controller are: | |
325 | ||
326 | * unbind flexible resources from the primary controller | |
327 | ||
328 | .. code-block:: console | |
329 | ||
330 | nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0 | |
331 | nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0 | |
332 | ||
333 | * perform a Function Level Reset on the primary controller to actually | |
334 | release the resources | |
335 | ||
336 | .. code-block:: console | |
337 | ||
338 | echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset | |
339 | ||
340 | * enable VF | |
341 | ||
342 | .. code-block:: console | |
343 | ||
344 | echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs | |
345 | ||
346 | * assign the flexible resources to the VF and set it ONLINE | |
347 | ||
348 | .. code-block:: console | |
349 | ||
350 | nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1 | |
351 | nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2 | |
352 | nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0 | |
353 | ||
354 | * bind the NVMe driver to the VF | |
355 | ||
356 | .. code-block:: console | |
357 | ||
e409c905 | 358 | echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind |