]>
Commit | Line | Data |
---|---|---|
b0a4aa95 MCC |
1 | =============================== |
2 | LIBNVDIMM: Non-Volatile Devices | |
3 | =============================== | |
bc30196f | 4 | |
b0a4aa95 MCC |
5 | libnvdimm - kernel / libndctl - userspace helper library |
6 | ||
7 | linux-nvdimm@lists.01.org | |
8 | ||
9 | Version 13 | |
10 | ||
11 | .. contents: | |
bc30196f DW |
12 | |
13 | Glossary | |
14 | Overview | |
15 | Supporting Documents | |
16 | Git Trees | |
17 | LIBNVDIMM PMEM and BLK | |
18 | Why BLK? | |
19 | PMEM vs BLK | |
20 | BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX | |
21 | Example NVDIMM Platform | |
22 | LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API | |
23 | LIBNDCTL: Context | |
24 | libndctl: instantiate a new library context example | |
25 | LIBNVDIMM/LIBNDCTL: Bus | |
26 | libnvdimm: control class device in /sys/class | |
27 | libnvdimm: bus | |
28 | libndctl: bus enumeration example | |
29 | LIBNVDIMM/LIBNDCTL: DIMM (NMEM) | |
30 | libnvdimm: DIMM (NMEM) | |
31 | libndctl: DIMM enumeration example | |
32 | LIBNVDIMM/LIBNDCTL: Region | |
33 | libnvdimm: region | |
34 | libndctl: region enumeration example | |
35 | Why Not Encode the Region Type into the Region Name? | |
36 | How Do I Determine the Major Type of a Region? | |
37 | LIBNVDIMM/LIBNDCTL: Namespace | |
38 | libnvdimm: namespace | |
39 | libndctl: namespace enumeration example | |
40 | libndctl: namespace creation example | |
41 | Why the Term "namespace"? | |
42 | LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" | |
43 | libnvdimm: btt layout | |
44 | libndctl: btt creation example | |
45 | Summary LIBNDCTL Diagram | |
46 | ||
47 | ||
48 | Glossary | |
b0a4aa95 MCC |
49 | ======== |
50 | ||
51 | PMEM: | |
52 | A system-physical-address range where writes are persistent. A | |
53 | block device composed of PMEM is capable of DAX. A PMEM address range | |
54 | may span an interleave of several DIMMs. | |
55 | ||
56 | BLK: | |
57 | A set of one or more programmable memory mapped apertures provided | |
58 | by a DIMM to access its media. This indirection precludes the | |
59 | performance benefit of interleaving, but enables DIMM-bounded failure | |
60 | modes. | |
61 | ||
62 | DPA: | |
63 | DIMM Physical Address, is a DIMM-relative offset. With one DIMM in | |
64 | the system there would be a 1:1 system-physical-address:DPA association. | |
65 | Once more DIMMs are added a memory controller interleave must be | |
66 | decoded to determine the DPA associated with a given | |
67 | system-physical-address. BLK capacity always has a 1:1 relationship | |
68 | with a single-DIMM's DPA range. | |
69 | ||
70 | DAX: | |
71 | File system extensions to bypass the page cache and block layer to | |
72 | mmap persistent memory, from a PMEM block device, directly into a | |
73 | process address space. | |
74 | ||
75 | DSM: | |
76 | Device Specific Method: ACPI method to to control specific | |
77 | device - in this case the firmware. | |
78 | ||
79 | DCR: | |
80 | NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. | |
81 | It defines a vendor-id, device-id, and interface format for a given DIMM. | |
82 | ||
83 | BTT: | |
84 | Block Translation Table: Persistent memory is byte addressable. | |
85 | Existing software may have an expectation that the power-fail-atomicity | |
86 | of writes is at least one sector, 512 bytes. The BTT is an indirection | |
87 | table with atomic update semantics to front a PMEM/BLK block device | |
88 | driver and present arbitrary atomic sector sizes. | |
89 | ||
90 | LABEL: | |
91 | Metadata stored on a DIMM device that partitions and identifies | |
92 | (persistently names) storage between PMEM and BLK. It also partitions | |
93 | BLK storage to host BTTs with different parameters per BLK-partition. | |
94 | Note that traditional partition tables, GPT/MBR, are layered on top of a | |
95 | BLK or PMEM device. | |
bc30196f DW |
96 | |
97 | ||
98 | Overview | |
b0a4aa95 | 99 | ======== |
bc30196f DW |
100 | |
101 | The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely, | |
102 | PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM | |
103 | and BLK mode access. These three modes of operation are described by | |
104 | the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6. While the LIBNVDIMM | |
105 | implementation is generic and supports pre-NFIT platforms, it was guided | |
106 | by the superset of capabilities need to support this ACPI 6 definition | |
107 | for NVDIMM resources. The bulk of the kernel implementation is in place | |
108 | to handle the case where DPA accessible via PMEM is aliased with DPA | |
109 | accessible via BLK. When that occurs a LABEL is needed to reserve DPA | |
110 | for exclusive access via one mode a time. | |
111 | ||
112 | Supporting Documents | |
b0a4aa95 MCC |
113 | -------------------- |
114 | ||
115 | ACPI 6: | |
116 | http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf | |
117 | NVDIMM Namespace: | |
118 | http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf | |
119 | DSM Interface Example: | |
120 | http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf | |
121 | Driver Writer's Guide: | |
122 | http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf | |
bc30196f DW |
123 | |
124 | Git Trees | |
b0a4aa95 MCC |
125 | --------- |
126 | ||
127 | LIBNVDIMM: | |
128 | https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git | |
129 | LIBNDCTL: | |
130 | https://github.com/pmem/ndctl.git | |
131 | PMEM: | |
132 | https://github.com/01org/prd | |
bc30196f DW |
133 | |
134 | ||
135 | LIBNVDIMM PMEM and BLK | |
b0a4aa95 | 136 | ====================== |
bc30196f DW |
137 | |
138 | Prior to the arrival of the NFIT, non-volatile memory was described to a | |
139 | system in various ad-hoc ways. Usually only the bare minimum was | |
140 | provided, namely, a single system-physical-address range where writes | |
141 | are expected to be durable after a system power loss. Now, the NFIT | |
142 | specification standardizes not only the description of PMEM, but also | |
143 | BLK and platform message-passing entry points for control and | |
144 | configuration. | |
145 | ||
146 | For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block | |
147 | device driver: | |
148 | ||
149 | 1. PMEM (nd_pmem.ko): Drives a system-physical-address range. This | |
b0a4aa95 MCC |
150 | range is contiguous in system memory and may be interleaved (hardware |
151 | memory controller striped) across multiple DIMMs. When interleaved the | |
152 | platform may optionally provide details of which DIMMs are participating | |
153 | in the interleave. | |
154 | ||
155 | Note that while LIBNVDIMM describes system-physical-address ranges that may | |
156 | alias with BLK access as ND_NAMESPACE_PMEM ranges and those without | |
157 | alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no | |
158 | distinction. The different device-types are an implementation detail | |
159 | that userspace can exploit to implement policies like "only interface | |
160 | with address ranges from certain DIMMs". It is worth noting that when | |
161 | aliasing is present and a DIMM lacks a label, then no block device can | |
162 | be created by default as userspace needs to do at least one allocation | |
163 | of DPA to the PMEM range. In contrast ND_NAMESPACE_IO ranges, once | |
164 | registered, can be immediately attached to nd_pmem. | |
bc30196f DW |
165 | |
166 | 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform | |
b0a4aa95 MCC |
167 | defined apertures. A set of apertures will access just one DIMM. |
168 | Multiple windows (apertures) allow multiple concurrent accesses, much like | |
169 | tagged-command-queuing, and would likely be used by different threads or | |
170 | different CPUs. | |
171 | ||
172 | The NFIT specification defines a standard format for a BLK-aperture, but | |
173 | the spec also allows for vendor specific layouts, and non-NFIT BLK | |
174 | implementations may have other designs for BLK I/O. For this reason | |
175 | "nd_blk" calls back into platform-specific code to perform the I/O. | |
bc30196f | 176 | |
b0a4aa95 MCC |
177 | One such implementation is defined in the "Driver Writer's Guide" and "DSM |
178 | Interface Example". | |
bc30196f DW |
179 | |
180 | ||
181 | Why BLK? | |
b0a4aa95 | 182 | ======== |
bc30196f DW |
183 | |
184 | While PMEM provides direct byte-addressable CPU-load/store access to | |
185 | NVDIMM storage, it does not provide the best system RAS (recovery, | |
186 | availability, and serviceability) model. An access to a corrupted | |
8de5dff8 | 187 | system-physical-address address causes a CPU exception while an access |
bc30196f DW |
188 | to a corrupted address through an BLK-aperture causes that block window |
189 | to raise an error status in a register. The latter is more aligned with | |
190 | the standard error model that host-bus-adapter attached disks present. | |
b0a4aa95 | 191 | |
bc30196f DW |
192 | Also, if an administrator ever wants to replace a memory it is easier to |
193 | service a system at DIMM module boundaries. Compare this to PMEM where | |
194 | data could be interleaved in an opaque hardware specific manner across | |
195 | several DIMMs. | |
196 | ||
197 | PMEM vs BLK | |
b0a4aa95 MCC |
198 | ----------- |
199 | ||
8de5dff8 | 200 | BLK-apertures solve these RAS problems, but their presence is also the |
bc30196f DW |
201 | major contributing factor to the complexity of the ND subsystem. They |
202 | complicate the implementation because PMEM and BLK alias in DPA space. | |
203 | Any given DIMM's DPA-range may contribute to one or more | |
204 | system-physical-address sets of interleaved DIMMs, *and* may also be | |
205 | accessed in its entirety through its BLK-aperture. Accessing a DPA | |
206 | through a system-physical-address while simultaneously accessing the | |
207 | same DPA through a BLK-aperture has undefined results. For this reason, | |
208 | DIMMs with this dual interface configuration include a DSM function to | |
209 | store/retrieve a LABEL. The LABEL effectively partitions the DPA-space | |
210 | into exclusive system-physical-address and BLK-aperture accessible | |
211 | regions. For simplicity a DIMM is allowed a PMEM "region" per each | |
212 | interleave set in which it is a member. The remaining DPA space can be | |
213 | carved into an arbitrary number of BLK devices with discontiguous | |
214 | extents. | |
215 | ||
216 | BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX | |
b0a4aa95 | 217 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
bc30196f DW |
218 | |
219 | One of the few | |
220 | reasons to allow multiple BLK namespaces per REGION is so that each | |
221 | BLK-namespace can be configured with a BTT with unique atomic sector | |
222 | sizes. While a PMEM device can host a BTT the LABEL specification does | |
223 | not provide for a sector size to be specified for a PMEM namespace. | |
b0a4aa95 | 224 | |
bc30196f DW |
225 | This is due to the expectation that the primary usage model for PMEM is |
226 | via DAX, and the BTT is incompatible with DAX. However, for the cases | |
227 | where an application or filesystem still needs atomic sector update | |
228 | guarantees it can register a BTT on a PMEM device or partition. See | |
229 | LIBNVDIMM/NDCTL: Block Translation Table "btt" | |
230 | ||
231 | ||
232 | Example NVDIMM Platform | |
b0a4aa95 | 233 | ======================= |
bc30196f DW |
234 | |
235 | For the remainder of this document the following diagram will be | |
b0a4aa95 MCC |
236 | referenced for any example sysfs layouts:: |
237 | ||
238 | ||
239 | (a) (b) DIMM BLK-REGION | |
240 | +-------------------+--------+--------+--------+ | |
241 | +------+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 | 0 region2 | |
242 | | imc0 +--+- - - region0- - - +--------+ +--------+ | |
243 | +--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 | 1 region3 | |
244 | | +-------------------+--------v v--------+ | |
245 | +--+---+ | | | |
246 | | cpu0 | region1 | |
247 | +--+---+ | | | |
248 | | +----------------------------^ ^--------+ | |
249 | +--+---+ | blk4.0 | pm1.0 | blk4.0 | 2 region4 | |
250 | | imc1 +--+----------------------------| +--------+ | |
251 | +------+ | blk5.0 | pm1.0 | blk5.0 | 3 region5 | |
252 | +----------------------------+--------+--------+ | |
bc30196f DW |
253 | |
254 | In this platform we have four DIMMs and two memory controllers in one | |
255 | socket. Each unique interface (BLK or PMEM) to DPA space is identified | |
256 | by a region device with a dynamically assigned id (REGION0 - REGION5). | |
257 | ||
258 | 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A | |
b0a4aa95 MCC |
259 | single PMEM namespace is created in the REGION0-SPA-range that spans most |
260 | of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that | |
261 | interleaved system-physical-address range is reclaimed as BLK-aperture | |
262 | accessed space starting at DPA-offset (a) into each DIMM. In that | |
263 | reclaimed space we create two BLK-aperture "namespaces" from REGION2 and | |
264 | REGION3 where "blk2.0" and "blk3.0" are just human readable names that | |
265 | could be set to any user-desired name in the LABEL. | |
bc30196f DW |
266 | |
267 | 2. In the last portion of DIMM0 and DIMM1 we have an interleaved | |
b0a4aa95 MCC |
268 | system-physical-address range, REGION1, that spans those two DIMMs as |
269 | well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace | |
270 | named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for | |
271 | each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and | |
272 | "blk5.0". | |
bc30196f DW |
273 | |
274 | 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 | |
b0a4aa95 MCC |
275 | interleaved system-physical-address range (i.e. the DPA address past |
276 | offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. | |
277 | Note, that this example shows that BLK-aperture namespaces don't need to | |
278 | be contiguous in DPA-space. | |
bc30196f DW |
279 | |
280 | This bus is provided by the kernel under the device | |
281 | /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and | |
282 | the nfit_test.ko module is loaded. This not only test LIBNVDIMM but the | |
283 | acpi_nfit.ko driver as well. | |
284 | ||
285 | ||
286 | LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API | |
b0a4aa95 | 287 | ======================================================== |
bc30196f DW |
288 | |
289 | What follows is a description of the LIBNVDIMM sysfs layout and a | |
290 | corresponding object hierarchy diagram as viewed through the LIBNDCTL | |
8de5dff8 | 291 | API. The example sysfs paths and diagrams are relative to the Example |
bc30196f DW |
292 | NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit |
293 | test. | |
294 | ||
295 | LIBNDCTL: Context | |
b0a4aa95 MCC |
296 | ----------------- |
297 | ||
8de5dff8 | 298 | Every API call in the LIBNDCTL library requires a context that holds the |
bc30196f DW |
299 | logging parameters and other library instance state. The library is |
300 | based on the libabc template: | |
b0a4aa95 MCC |
301 | |
302 | https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git | |
bc30196f DW |
303 | |
304 | LIBNDCTL: instantiate a new library context example | |
b0a4aa95 MCC |
305 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
306 | ||
307 | :: | |
bc30196f DW |
308 | |
309 | struct ndctl_ctx *ctx; | |
310 | ||
311 | if (ndctl_new(&ctx) == 0) | |
312 | return ctx; | |
313 | else | |
314 | return NULL; | |
315 | ||
316 | LIBNVDIMM/LIBNDCTL: Bus | |
b0a4aa95 | 317 | ----------------------- |
bc30196f DW |
318 | |
319 | A bus has a 1:1 relationship with an NFIT. The current expectation for | |
320 | ACPI based systems is that there is only ever one platform-global NFIT. | |
321 | That said, it is trivial to register multiple NFITs, the specification | |
322 | does not preclude it. The infrastructure supports multiple busses and | |
3d9cf48b SR |
323 | we use this capability to test multiple NFIT configurations in the unit |
324 | test. | |
bc30196f DW |
325 | |
326 | LIBNVDIMM: control class device in /sys/class | |
b0a4aa95 | 327 | --------------------------------------------- |
bc30196f DW |
328 | |
329 | This character device accepts DSM messages to be passed to DIMM | |
b0a4aa95 | 330 | identified by its NFIT handle:: |
bc30196f DW |
331 | |
332 | /sys/class/nd/ndctl0 | |
333 | |-- dev | |
334 | |-- device -> ../../../ndbus0 | |
335 | |-- subsystem -> ../../../../../../../class/nd | |
336 | ||
337 | ||
338 | ||
339 | LIBNVDIMM: bus | |
b0a4aa95 MCC |
340 | -------------- |
341 | ||
342 | :: | |
bc30196f DW |
343 | |
344 | struct nvdimm_bus *nvdimm_bus_register(struct device *parent, | |
345 | struct nvdimm_bus_descriptor *nfit_desc); | |
346 | ||
b0a4aa95 MCC |
347 | :: |
348 | ||
bc30196f DW |
349 | /sys/devices/platform/nfit_test.0/ndbus0 |
350 | |-- commands | |
351 | |-- nd | |
352 | |-- nfit | |
353 | |-- nmem0 | |
354 | |-- nmem1 | |
355 | |-- nmem2 | |
356 | |-- nmem3 | |
357 | |-- power | |
358 | |-- provider | |
359 | |-- region0 | |
360 | |-- region1 | |
361 | |-- region2 | |
362 | |-- region3 | |
363 | |-- region4 | |
364 | |-- region5 | |
365 | |-- uevent | |
366 | `-- wait_probe | |
367 | ||
368 | LIBNDCTL: bus enumeration example | |
b0a4aa95 MCC |
369 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
370 | ||
371 | Find the bus handle that describes the bus from Example NVDIMM Platform:: | |
bc30196f DW |
372 | |
373 | static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx, | |
374 | const char *provider) | |
375 | { | |
376 | struct ndctl_bus *bus; | |
377 | ||
378 | ndctl_bus_foreach(ctx, bus) | |
379 | if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0) | |
380 | return bus; | |
381 | ||
382 | return NULL; | |
383 | } | |
384 | ||
385 | bus = get_bus_by_provider(ctx, "nfit_test.0"); | |
386 | ||
387 | ||
388 | LIBNVDIMM/LIBNDCTL: DIMM (NMEM) | |
b0a4aa95 | 389 | ------------------------------- |
bc30196f DW |
390 | |
391 | The DIMM device provides a character device for sending commands to | |
392 | hardware, and it is a container for LABELs. If the DIMM is defined by | |
393 | NFIT then an optional 'nfit' attribute sub-directory is available to add | |
394 | NFIT-specifics. | |
395 | ||
396 | Note that the kernel device name for "DIMMs" is "nmemX". The NFIT | |
397 | describes these devices via "Memory Device to System Physical Address | |
398 | Range Mapping Structure", and there is no requirement that they actually | |
399 | be physical DIMMs, so we use a more generic name. | |
400 | ||
401 | LIBNVDIMM: DIMM (NMEM) | |
b0a4aa95 MCC |
402 | ^^^^^^^^^^^^^^^^^^^^^^ |
403 | ||
404 | :: | |
bc30196f DW |
405 | |
406 | struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data, | |
407 | const struct attribute_group **groups, unsigned long flags, | |
408 | unsigned long *dsm_mask); | |
409 | ||
b0a4aa95 MCC |
410 | :: |
411 | ||
bc30196f DW |
412 | /sys/devices/platform/nfit_test.0/ndbus0 |
413 | |-- nmem0 | |
414 | | |-- available_slots | |
415 | | |-- commands | |
416 | | |-- dev | |
417 | | |-- devtype | |
418 | | |-- driver -> ../../../../../bus/nd/drivers/nvdimm | |
419 | | |-- modalias | |
420 | | |-- nfit | |
421 | | | |-- device | |
422 | | | |-- format | |
423 | | | |-- handle | |
424 | | | |-- phys_id | |
425 | | | |-- rev_id | |
426 | | | |-- serial | |
427 | | | `-- vendor | |
428 | | |-- state | |
429 | | |-- subsystem -> ../../../../../bus/nd | |
430 | | `-- uevent | |
431 | |-- nmem1 | |
432 | [..] | |
433 | ||
434 | ||
435 | LIBNDCTL: DIMM enumeration example | |
b0a4aa95 | 436 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
bc30196f DW |
437 | |
438 | Note, in this example we are assuming NFIT-defined DIMMs which are | |
439 | identified by an "nfit_handle" a 32-bit value where: | |
b0a4aa95 MCC |
440 | |
441 | - Bit 3:0 DIMM number within the memory channel | |
442 | - Bit 7:4 memory channel number | |
443 | - Bit 11:8 memory controller ID | |
444 | - Bit 15:12 socket ID (within scope of a Node controller if node | |
445 | controller is present) | |
446 | - Bit 27:16 Node Controller ID | |
447 | - Bit 31:28 Reserved | |
448 | ||
449 | :: | |
bc30196f DW |
450 | |
451 | static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus, | |
452 | unsigned int handle) | |
453 | { | |
454 | struct ndctl_dimm *dimm; | |
455 | ||
456 | ndctl_dimm_foreach(bus, dimm) | |
457 | if (ndctl_dimm_get_handle(dimm) == handle) | |
458 | return dimm; | |
459 | ||
460 | return NULL; | |
461 | } | |
462 | ||
463 | #define DIMM_HANDLE(n, s, i, c, d) \ | |
464 | (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \ | |
465 | | ((c & 0xf) << 4) | (d & 0xf)) | |
466 | ||
467 | dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0)); | |
468 | ||
469 | LIBNVDIMM/LIBNDCTL: Region | |
b0a4aa95 | 470 | -------------------------- |
bc30196f | 471 | |
8de5dff8 | 472 | A generic REGION device is registered for each PMEM range or BLK-aperture |
bc30196f DW |
473 | set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture |
474 | sets on the "nfit_test.0" bus. The primary role of regions are to be a | |
475 | container of "mappings". A mapping is a tuple of <DIMM, | |
476 | DPA-start-offset, length>. | |
477 | ||
478 | LIBNVDIMM provides a built-in driver for these REGION devices. This driver | |
479 | is responsible for reconciling the aliased DPA mappings across all | |
480 | regions, parsing the LABEL, if present, and then emitting NAMESPACE | |
481 | devices with the resolved/exclusive DPA-boundaries for the nd_pmem or | |
482 | nd_blk device driver to consume. | |
483 | ||
484 | In addition to the generic attributes of "mapping"s, "interleave_ways" | |
485 | and "size" the REGION device also exports some convenience attributes. | |
486 | "nstype" indicates the integer type of namespace-device this region | |
487 | emits, "devtype" duplicates the DEVTYPE variable stored by udev at the | |
488 | 'add' event, "modalias" duplicates the MODALIAS variable stored by udev | |
489 | at the 'add' event, and finally, the optional "spa_index" is provided in | |
490 | the case where the region is defined by a SPA. | |
491 | ||
b0a4aa95 | 492 | LIBNVDIMM: region:: |
bc30196f DW |
493 | |
494 | struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus, | |
495 | struct nd_region_desc *ndr_desc); | |
496 | struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus, | |
497 | struct nd_region_desc *ndr_desc); | |
498 | ||
b0a4aa95 MCC |
499 | :: |
500 | ||
bc30196f DW |
501 | /sys/devices/platform/nfit_test.0/ndbus0 |
502 | |-- region0 | |
503 | | |-- available_size | |
504 | | |-- btt0 | |
505 | | |-- btt_seed | |
506 | | |-- devtype | |
507 | | |-- driver -> ../../../../../bus/nd/drivers/nd_region | |
508 | | |-- init_namespaces | |
509 | | |-- mapping0 | |
510 | | |-- mapping1 | |
511 | | |-- mappings | |
512 | | |-- modalias | |
513 | | |-- namespace0.0 | |
514 | | |-- namespace_seed | |
515 | | |-- numa_node | |
516 | | |-- nfit | |
517 | | | `-- spa_index | |
518 | | |-- nstype | |
519 | | |-- set_cookie | |
520 | | |-- size | |
521 | | |-- subsystem -> ../../../../../bus/nd | |
522 | | `-- uevent | |
523 | |-- region1 | |
524 | [..] | |
525 | ||
526 | LIBNDCTL: region enumeration example | |
b0a4aa95 | 527 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
bc30196f DW |
528 | |
529 | Sample region retrieval routines based on NFIT-unique data like | |
530 | "spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for | |
b0a4aa95 | 531 | BLK:: |
bc30196f DW |
532 | |
533 | static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus, | |
534 | unsigned int spa_index) | |
535 | { | |
536 | struct ndctl_region *region; | |
537 | ||
538 | ndctl_region_foreach(bus, region) { | |
539 | if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM) | |
540 | continue; | |
541 | if (ndctl_region_get_spa_index(region) == spa_index) | |
542 | return region; | |
543 | } | |
544 | return NULL; | |
545 | } | |
546 | ||
547 | static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus, | |
548 | unsigned int handle) | |
549 | { | |
550 | struct ndctl_region *region; | |
551 | ||
552 | ndctl_region_foreach(bus, region) { | |
553 | struct ndctl_mapping *map; | |
554 | ||
555 | if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK) | |
556 | continue; | |
557 | ndctl_mapping_foreach(region, map) { | |
558 | struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map); | |
559 | ||
560 | if (ndctl_dimm_get_handle(dimm) == handle) | |
561 | return region; | |
562 | } | |
563 | } | |
564 | return NULL; | |
565 | } | |
566 | ||
567 | ||
568 | Why Not Encode the Region Type into the Region Name? | |
569 | ---------------------------------------------------- | |
570 | ||
571 | At first glance it seems since NFIT defines just PMEM and BLK interface | |
572 | types that we should simply name REGION devices with something derived | |
573 | from those type names. However, the ND subsystem explicitly keeps the | |
574 | REGION name generic and expects userspace to always consider the | |
8de5dff8 | 575 | region-attributes for four reasons: |
bc30196f DW |
576 | |
577 | 1. There are already more than two REGION and "namespace" types. For | |
b0a4aa95 MCC |
578 | PMEM there are two subtypes. As mentioned previously we have PMEM where |
579 | the constituent DIMM devices are known and anonymous PMEM. For BLK | |
580 | regions the NFIT specification already anticipates vendor specific | |
581 | implementations. The exact distinction of what a region contains is in | |
582 | the region-attributes not the region-name or the region-devtype. | |
bc30196f DW |
583 | |
584 | 2. A region with zero child-namespaces is a possible configuration. For | |
b0a4aa95 MCC |
585 | example, the NFIT allows for a DCR to be published without a |
586 | corresponding BLK-aperture. This equates to a DIMM that can only accept | |
587 | control/configuration messages, but no i/o through a descendant block | |
588 | device. Again, this "type" is advertised in the attributes ('mappings' | |
589 | == 0) and the name does not tell you much. | |
bc30196f DW |
590 | |
591 | 3. What if a third major interface type arises in the future? Outside | |
b0a4aa95 MCC |
592 | of vendor specific implementations, it's not difficult to envision a |
593 | third class of interface type beyond BLK and PMEM. With a generic name | |
594 | for the REGION level of the device-hierarchy old userspace | |
595 | implementations can still make sense of new kernel advertised | |
596 | region-types. Userspace can always rely on the generic region | |
597 | attributes like "mappings", "size", etc and the expected child devices | |
598 | named "namespace". This generic format of the device-model hierarchy | |
599 | allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and | |
600 | future-proof. | |
bc30196f DW |
601 | |
602 | 4. There are more robust mechanisms for determining the major type of a | |
b0a4aa95 MCC |
603 | region than a device name. See the next section, How Do I Determine the |
604 | Major Type of a Region? | |
bc30196f DW |
605 | |
606 | How Do I Determine the Major Type of a Region? | |
607 | ---------------------------------------------- | |
608 | ||
609 | Outside of the blanket recommendation of "use libndctl", or simply | |
610 | looking at the kernel header (/usr/include/linux/ndctl.h) to decode the | |
611 | "nstype" integer attribute, here are some other options. | |
612 | ||
b0a4aa95 MCC |
613 | 1. module alias lookup |
614 | ^^^^^^^^^^^^^^^^^^^^^^ | |
bc30196f DW |
615 | |
616 | The whole point of region/namespace device type differentiation is to | |
617 | decide which block-device driver will attach to a given LIBNVDIMM namespace. | |
618 | One can simply use the modalias to lookup the resulting module. It's | |
619 | important to note that this method is robust in the presence of a | |
620 | vendor-specific driver down the road. If a vendor-specific | |
621 | implementation wants to supplant the standard nd_blk driver it can with | |
622 | minimal impact to the rest of LIBNVDIMM. | |
623 | ||
624 | In fact, a vendor may also want to have a vendor-specific region-driver | |
625 | (outside of nd_region). For example, if a vendor defined its own LABEL | |
626 | format it would need its own region driver to parse that LABEL and emit | |
627 | the resulting namespaces. The output from module resolution is more | |
628 | accurate than a region-name or region-devtype. | |
629 | ||
b0a4aa95 MCC |
630 | 2. udev |
631 | ^^^^^^^ | |
632 | ||
633 | The kernel "devtype" is registered in the udev database:: | |
bc30196f | 634 | |
b0a4aa95 MCC |
635 | # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0 |
636 | P: /devices/platform/nfit_test.0/ndbus0/region0 | |
637 | E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0 | |
638 | E: DEVTYPE=nd_pmem | |
639 | E: MODALIAS=nd:t2 | |
640 | E: SUBSYSTEM=nd | |
bc30196f | 641 | |
b0a4aa95 MCC |
642 | # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4 |
643 | P: /devices/platform/nfit_test.0/ndbus0/region4 | |
644 | E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4 | |
645 | E: DEVTYPE=nd_blk | |
646 | E: MODALIAS=nd:t3 | |
647 | E: SUBSYSTEM=nd | |
bc30196f DW |
648 | |
649 | ...and is available as a region attribute, but keep in mind that the | |
650 | "devtype" does not indicate sub-type variations and scripts should | |
651 | really be understanding the other attributes. | |
652 | ||
b0a4aa95 MCC |
653 | 3. type specific attributes |
654 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
bc30196f DW |
655 | |
656 | As it currently stands a BLK-aperture region will never have a | |
657 | "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region. A | |
658 | BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM | |
659 | that does not allow I/O. A PMEM region with a "mappings" value of zero | |
660 | is a simple system-physical-address range. | |
661 | ||
662 | ||
663 | LIBNVDIMM/LIBNDCTL: Namespace | |
b0a4aa95 | 664 | ----------------------------- |
bc30196f DW |
665 | |
666 | A REGION, after resolving DPA aliasing and LABEL specified boundaries, | |
667 | surfaces one or more "namespace" devices. The arrival of a "namespace" | |
668 | device currently triggers either the nd_blk or nd_pmem driver to load | |
669 | and register a disk/block device. | |
670 | ||
671 | LIBNVDIMM: namespace | |
b0a4aa95 MCC |
672 | ^^^^^^^^^^^^^^^^^^^^ |
673 | ||
bc30196f DW |
674 | Here is a sample layout from the three major types of NAMESPACE where |
675 | namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid' | |
676 | attribute), namespace2.0 represents a BLK namespace (note it has a | |
677 | 'sector_size' attribute) that, and namespace6.0 represents an anonymous | |
678 | PMEM namespace (note that has no 'uuid' attribute due to not support a | |
b0a4aa95 | 679 | LABEL):: |
bc30196f DW |
680 | |
681 | /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0 | |
682 | |-- alt_name | |
683 | |-- devtype | |
684 | |-- dpa_extents | |
685 | |-- force_raw | |
686 | |-- modalias | |
687 | |-- numa_node | |
688 | |-- resource | |
689 | |-- size | |
690 | |-- subsystem -> ../../../../../../bus/nd | |
691 | |-- type | |
692 | |-- uevent | |
693 | `-- uuid | |
694 | /sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0 | |
695 | |-- alt_name | |
696 | |-- devtype | |
697 | |-- dpa_extents | |
698 | |-- force_raw | |
699 | |-- modalias | |
700 | |-- numa_node | |
701 | |-- sector_size | |
702 | |-- size | |
703 | |-- subsystem -> ../../../../../../bus/nd | |
704 | |-- type | |
705 | |-- uevent | |
706 | `-- uuid | |
707 | /sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0 | |
708 | |-- block | |
709 | | `-- pmem0 | |
710 | |-- devtype | |
711 | |-- driver -> ../../../../../../bus/nd/drivers/pmem | |
712 | |-- force_raw | |
713 | |-- modalias | |
714 | |-- numa_node | |
715 | |-- resource | |
716 | |-- size | |
717 | |-- subsystem -> ../../../../../../bus/nd | |
718 | |-- type | |
719 | `-- uevent | |
720 | ||
721 | LIBNDCTL: namespace enumeration example | |
b0a4aa95 | 722 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
bc30196f DW |
723 | Namespaces are indexed relative to their parent region, example below. |
724 | These indexes are mostly static from boot to boot, but subsystem makes | |
725 | no guarantees in this regard. For a static namespace identifier use its | |
726 | 'uuid' attribute. | |
727 | ||
b0a4aa95 | 728 | :: |
bc30196f | 729 | |
b0a4aa95 MCC |
730 | static struct ndctl_namespace |
731 | *get_namespace_by_id(struct ndctl_region *region, unsigned int id) | |
732 | { | |
733 | struct ndctl_namespace *ndns; | |
bc30196f | 734 | |
b0a4aa95 MCC |
735 | ndctl_namespace_foreach(region, ndns) |
736 | if (ndctl_namespace_get_id(ndns) == id) | |
737 | return ndns; | |
738 | ||
739 | return NULL; | |
740 | } | |
bc30196f DW |
741 | |
742 | LIBNDCTL: namespace creation example | |
b0a4aa95 MCC |
743 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
744 | ||
bc30196f DW |
745 | Idle namespaces are automatically created by the kernel if a given |
746 | region has enough available capacity to create a new namespace. | |
747 | Namespace instantiation involves finding an idle namespace and | |
748 | configuring it. For the most part the setting of namespace attributes | |
749 | can occur in any order, the only constraint is that 'uuid' must be set | |
750 | before 'size'. This enables the kernel to track DPA allocations | |
b0a4aa95 | 751 | internally with a static identifier:: |
bc30196f | 752 | |
b0a4aa95 MCC |
753 | static int configure_namespace(struct ndctl_region *region, |
754 | struct ndctl_namespace *ndns, | |
755 | struct namespace_parameters *parameters) | |
756 | { | |
757 | char devname[50]; | |
bc30196f | 758 | |
b0a4aa95 MCC |
759 | snprintf(devname, sizeof(devname), "namespace%d.%d", |
760 | ndctl_region_get_id(region), paramaters->id); | |
bc30196f | 761 | |
b0a4aa95 MCC |
762 | ndctl_namespace_set_alt_name(ndns, devname); |
763 | /* 'uuid' must be set prior to setting size! */ | |
764 | ndctl_namespace_set_uuid(ndns, paramaters->uuid); | |
765 | ndctl_namespace_set_size(ndns, paramaters->size); | |
766 | /* unlike pmem namespaces, blk namespaces have a sector size */ | |
767 | if (parameters->lbasize) | |
768 | ndctl_namespace_set_sector_size(ndns, parameters->lbasize); | |
769 | ndctl_namespace_enable(ndns); | |
770 | } | |
bc30196f DW |
771 | |
772 | ||
773 | Why the Term "namespace"? | |
b0a4aa95 | 774 | ^^^^^^^^^^^^^^^^^^^^^^^^^ |
bc30196f | 775 | |
8de5dff8 | 776 | 1. Why not "volume" for instance? "volume" ran the risk of confusing |
b0a4aa95 | 777 | ND (libnvdimm subsystem) to a volume manager like device-mapper. |
bc30196f DW |
778 | |
779 | 2. The term originated to describe the sub-devices that can be created | |
b0a4aa95 MCC |
780 | within a NVME controller (see the nvme specification: |
781 | http://www.nvmexpress.org/specifications/), and NFIT namespaces are | |
782 | meant to parallel the capabilities and configurability of | |
783 | NVME-namespaces. | |
bc30196f DW |
784 | |
785 | ||
786 | LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" | |
b0a4aa95 | 787 | ------------------------------------------------- |
bc30196f DW |
788 | |
789 | A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked | |
790 | block device driver that fronts either the whole block device or a | |
791 | partition of a block device emitted by either a PMEM or BLK NAMESPACE. | |
792 | ||
793 | LIBNVDIMM: btt layout | |
b0a4aa95 MCC |
794 | ^^^^^^^^^^^^^^^^^^^^^ |
795 | ||
bc30196f DW |
796 | Every region will start out with at least one BTT device which is the |
797 | seed device. To activate it set the "namespace", "uuid", and | |
798 | "sector_size" attributes and then bind the device to the nd_pmem or | |
b0a4aa95 | 799 | nd_blk driver depending on the region type:: |
bc30196f DW |
800 | |
801 | /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/ | |
802 | |-- namespace | |
803 | |-- delete | |
804 | |-- devtype | |
805 | |-- modalias | |
806 | |-- numa_node | |
807 | |-- sector_size | |
808 | |-- subsystem -> ../../../../../bus/nd | |
809 | |-- uevent | |
810 | `-- uuid | |
811 | ||
812 | LIBNDCTL: btt creation example | |
b0a4aa95 MCC |
813 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
814 | ||
bc30196f DW |
815 | Similar to namespaces an idle BTT device is automatically created per |
816 | region. Each time this "seed" btt device is configured and enabled a new | |
817 | seed is created. Creating a BTT configuration involves two steps of | |
b0a4aa95 | 818 | finding and idle BTT and assigning it to consume a PMEM or BLK namespace:: |
bc30196f DW |
819 | |
820 | static struct ndctl_btt *get_idle_btt(struct ndctl_region *region) | |
821 | { | |
822 | struct ndctl_btt *btt; | |
823 | ||
824 | ndctl_btt_foreach(region, btt) | |
825 | if (!ndctl_btt_is_enabled(btt) | |
826 | && !ndctl_btt_is_configured(btt)) | |
827 | return btt; | |
828 | ||
829 | return NULL; | |
830 | } | |
831 | ||
832 | static int configure_btt(struct ndctl_region *region, | |
833 | struct btt_parameters *parameters) | |
834 | { | |
835 | btt = get_idle_btt(region); | |
836 | ||
837 | ndctl_btt_set_uuid(btt, parameters->uuid); | |
838 | ndctl_btt_set_sector_size(btt, parameters->sector_size); | |
839 | ndctl_btt_set_namespace(btt, parameters->ndns); | |
840 | /* turn off raw mode device */ | |
841 | ndctl_namespace_disable(parameters->ndns); | |
842 | /* turn on btt access */ | |
843 | ndctl_btt_enable(btt); | |
844 | } | |
845 | ||
846 | Once instantiated a new inactive btt seed device will appear underneath | |
847 | the region. | |
848 | ||
849 | Once a "namespace" is removed from a BTT that instance of the BTT device | |
850 | will be deleted or otherwise reset to default values. This deletion is | |
851 | only at the device model level. In order to destroy a BTT the "info | |
852 | block" needs to be destroyed. Note, that to destroy a BTT the media | |
853 | needs to be written in raw mode. By default, the kernel will autodetect | |
854 | the presence of a BTT and disable raw mode. This autodetect behavior | |
855 | can be suppressed by enabling raw mode for the namespace via the | |
8de5dff8 | 856 | ndctl_namespace_set_raw_mode() API. |
bc30196f DW |
857 | |
858 | ||
859 | Summary LIBNDCTL Diagram | |
860 | ------------------------ | |
861 | ||
8de5dff8 | 862 | For the given example above, here is the view of the objects as seen by the |
b0a4aa95 MCC |
863 | LIBNDCTL API:: |
864 | ||
865 | +---+ | |
866 | |CTX| +---------+ +--------------+ +---------------+ | |
867 | +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | | |
868 | | | +---------+ +--------------+ +---------------+ | |
869 | +-------+ | | +---------+ +--------------+ +---------------+ | |
870 | | DIMM0 <-+ | +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" | | |
871 | +-------+ | | | +---------+ +--------------+ +---------------+ | |
872 | | DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+ | |
873 | +-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6 "blk2.0" | | |
874 | | DIMM2 <-+ +----+ | +---------+ | +--------------+ +----------------------+ | |
875 | +-------+ | | +-> NAMESPACE2.1 +--> ND5 "blk2.1" | BTT2 | | |
876 | | DIMM3 <-+ | +--------------+ +----------------------+ | |
877 | +-------+ | +---------+ +--------------+ +---------------+ | |
878 | +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4 "blk3.0" | | |
879 | | +---------+ | +--------------+ +----------------------+ | |
880 | | +-> NAMESPACE3.1 +--> ND3 "blk3.1" | BTT1 | | |
881 | | +--------------+ +----------------------+ | |
882 | | +---------+ +--------------+ +---------------+ | |
883 | +-> REGION4 +---> NAMESPACE4.0 +--> ND2 "blk4.0" | | |
884 | | +---------+ +--------------+ +---------------+ | |
885 | | +---------+ +--------------+ +----------------------+ | |
886 | +-> REGION5 +---> NAMESPACE5.0 +--> ND1 "blk5.0" | BTT0 | | |
887 | +---------+ +--------------+ +---------------+------+ |