docs: nvdimm: add it to the driver-api book

author Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

Tue, 18 Jun 2019 19:32:31 +0000 (16:32 -0300)

committer Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

Mon, 15 Jul 2019 12:20:27 +0000 (09:20 -0300)
author Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Tue, 18 Jun 2019 19:32:31 +0000 (16:32 -0300)
committer Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Mon, 15 Jul 2019 12:20:27 +0000 (09:20 -0300)
diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst

index d665cd9ab95f3756567cfc24ad60bb8994c58ab1..410dd71107726ff8b7d04a5c29b451bf8a06dcd9 100644 (file)
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -44,6 +44,7 @@ available subsections can be seen below.
     mtdnand
     miscellaneous
     mei/index
+   nvdimm/index
     w1
     rapidio/index
     s390-drivers
diff --git a/Documentation/driver-api/nvdimm/btt.rst b/Documentation/driver-api/nvdimm/btt.rst

new file mode 100644 (file)

index 0000000..2d8269f
--- /dev/null
+++ b/Documentation/driver-api/nvdimm/btt.rst
@@ -0,0 +1,285 @@
+=============================
+BTT - Block Translation Table
+=============================
+
+
+1. Introduction
+===============
+
+Persistent memory based storage is able to perform IO at byte (or more
+accurately, cache line) granularity. However, we often want to expose such
+storage as traditional block devices. The block drivers for persistent memory
+will do exactly this. However, they do not provide any atomicity guarantees.
+Traditional SSDs typically provide protection against torn sectors in hardware,
+using stored energy in capacitors to complete in-flight block writes, or perhaps
+in firmware. We don't have this luxury with persistent memory - if a write is in
+progress, and we experience a power failure, the block will contain a mix of old
+and new data. Applications may not be prepared to handle such a scenario.
+
+The Block Translation Table (BTT) provides atomic sector update semantics for
+persistent memory devices, so that applications that rely on sector writes not
+being torn can continue to do so. The BTT manifests itself as a stacked block
+device, and reserves a portion of the underlying storage for its metadata. At
+the heart of it, is an indirection table that re-maps all the blocks on the
+volume. It can be thought of as an extremely simple file system that only
+provides atomic sector updates.
+
+
+2. Static Layout
+================
+
+The underlying storage on which a BTT can be laid out is not limited in any way.
+The BTT, however, splits the available space into chunks of up to 512 GiB,
+called "Arenas".
+
+Each arena follows the same layout for its metadata, and all references in an
+arena are internal to it (with the exception of one field that points to the
+next arena). The following depicts the "On-disk" metadata layout::
+
+
+    Backing Store     +------->  Arena
+  +---------------+   |   +------------------+
+  |               |   |   | Arena info block |
+  |    Arena 0    +---+   |       4K         |
+  |     512G      |       +------------------+
+  |               |       |                  |
+  +---------------+       |                  |
+  |               |       |                  |
+  |    Arena 1    |       |   Data Blocks    |
+  |     512G      |       |                  |
+  |               |       |                  |
+  +---------------+       |                  |
+  |       .       |       |                  |
+  |       .       |       |                  |
+  |       .       |       |                  |
+  |               |       |                  |
+  |               |       |                  |
+  +---------------+       +------------------+
+                          |                  |
+                          |     BTT Map      |
+                          |                  |
+                          |                  |
+                          +------------------+
+                          |                  |
+                          |     BTT Flog     |
+                          |                  |
+                          +------------------+
+                          | Info block copy  |
+                          |       4K         |
+                          +------------------+
+
+
+3. Theory of Operation
+======================
+
+
+a. The BTT Map
+--------------
+
+The map is a simple lookup/indirection table that maps an LBA to an internal
+block. Each map entry is 32 bits. The two most significant bits are special
+flags, and the remaining form the internal block number.
+
+======== =============================================================
+Bit      Description
+======== =============================================================
+31 - 30         Error and Zero flags - Used in the following way:
+
+          == ==  ====================================================
+          31 30  Description
+          == ==  ====================================================
+          0  0   Initial state. Reads return zeroes; Premap = Postmap
+          0  1   Zero state: Reads return zeroes
+          1  0   Error state: Reads fail; Writes clear 'E' bit
+          1  1   Normal Block – has valid postmap
+          == ==  ====================================================
+
+29 - 0  Mappings to internal 'postmap' blocks
+======== =============================================================
+
+
+Some of the terminology that will be subsequently used:
+
+============   ================================================================
+External LBA   LBA as made visible to upper layers.
+ABA            Arena Block Address - Block offset/number within an arena
+Premap ABA     The block offset into an arena, which was decided upon by range
+               checking the External LBA
+Postmap ABA    The block number in the "Data Blocks" area obtained after
+               indirection from the map
+nfree          The number of free blocks that are maintained at any given time.
+               This is the number of concurrent writes that can happen to the
+               arena.
+============   ================================================================
+
+
+For example, after adding a BTT, we surface a disk of 1024G. We get a read for
+the external LBA at 768G. This falls into the second arena, and of the 512G
+worth of blocks that this arena contributes, this block is at 256G. Thus, the
+premap ABA is 256G. We now refer to the map, and find out the mapping for block
+'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
+
+
+b. The BTT Flog
+---------------
+
+The BTT provides sector atomicity by making every write an "allocating write",
+i.e. Every write goes to a "free" block. A running list of free blocks is
+maintained in the form of the BTT flog. 'Flog' is a combination of the words
+"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
+
+========  =====================================================================
+lba       The premap ABA that is being written to
+old_map   The old postmap ABA - after 'this' write completes, this will be a
+         free block.
+new_map   The new postmap ABA. The map will up updated to reflect this
+         lba->postmap_aba mapping, but we log it here in case we have to
+         recover.
+seq      Sequence number to mark which of the 2 sections of this flog entry is
+         valid/newest. It cycles between 01->10->11->01 (binary) under normal
+         operation, with 00 indicating an uninitialized state.
+lba'     alternate lba entry
+old_map'  alternate old postmap entry
+new_map'  alternate new postmap entry
+seq'     alternate sequence number.
+========  =====================================================================
+
+Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also
+padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are
+done such that for any entry being written, it:
+a. overwrites the 'old' section in the entry based on sequence numbers
+b. writes the 'new' section such that the sequence number is written last.
+
+
+c. The concept of lanes
+-----------------------
+
+While 'nfree' describes the number of concurrent IOs an arena can process
+concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
+process::
+
+       nlanes = min(nfree, num_cpus)
+
+A lane number is obtained at the start of any IO, and is used for indexing into
+all the on-disk and in-memory data structures for the duration of the IO. If
+there are more CPUs than the max number of available lanes, than lanes are
+protected by spinlocks.
+
+
+d. In-memory data structure: Read Tracking Table (RTT)
+------------------------------------------------------
+
+Consider a case where we have two threads, one doing reads and the other,
+writes. We can hit a condition where the writer thread grabs a free block to do
+a new IO, but the (slow) reader thread is still reading from it. In other words,
+the reader consulted a map entry, and started reading the corresponding block. A
+writer started writing to the same external LBA, and finished the write updating
+the map for that external LBA to point to its new postmap ABA. At this point the
+internal, postmap block that the reader is (still) reading has been inserted
+into the list of free blocks. If another write comes in for the same LBA, it can
+grab this free block, and start writing to it, causing the reader to read
+incorrect data. To prevent this, we introduce the RTT.
+
+The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
+into rtt[lane_number], the postmap ABA it is reading, and clears it after the
+read is complete. Every writer thread, after grabbing a free block, checks the
+RTT for its presence. If the postmap free block is in the RTT, it waits till the
+reader clears the RTT entry, and only then starts writing to it.
+
+
+e. In-memory data structure: map locks
+--------------------------------------
+
+Consider a case where two writer threads are writing to the same LBA. There can
+be a race in the following sequence of steps::
+
+       free[lane] = map[premap_aba]
+       map[premap_aba] = postmap_aba
+
+Both threads can update their respective free[lane] with the same old, freed
+postmap_aba. This has made the layout inconsistent by losing a free entry, and
+at the same time, duplicating another free entry for two lanes.
+
+To solve this, we could have a single map lock (per arena) that has to be taken
+before performing the above sequence, but we feel that could be too contentious.
+Instead we use an array of (nfree) map_locks that is indexed by
+(premap_aba modulo nfree).
+
+
+f. Reconstruction from the Flog
+-------------------------------
+
+On startup, we analyze the BTT flog to create our list of free blocks. We walk
+through all the entries, and for each lane, of the set of two possible
+'sections', we always look at the most recent one only (based on the sequence
+number). The reconstruction rules/steps are simple:
+
+- Read map[log_entry.lba].
+- If log_entry.new matches the map entry, then log_entry.old is free.
+- If log_entry.new does not match the map entry, then log_entry.new is free.
+  (This case can only be caused by power-fails/unsafe shutdowns)
+
+
+g. Summarizing - Read and Write flows
+-------------------------------------
+
+Read:
+
+1.  Convert external LBA to arena number + pre-map ABA
+2.  Get a lane (and take lane_lock)
+3.  Read map to get the entry for this pre-map ABA
+4.  Enter post-map ABA into RTT[lane]
+5.  If TRIM flag set in map, return zeroes, and end IO (go to step 8)
+6.  If ERROR flag set in map, end IO with EIO (go to step 8)
+7.  Read data from this block
+8.  Remove post-map ABA entry from RTT[lane]
+9.  Release lane (and lane_lock)
+
+Write:
+
+1.  Convert external LBA to Arena number + pre-map ABA
+2.  Get a lane (and take lane_lock)
+3.  Use lane to index into in-memory free list and obtain a new block, next flog
+    index, next sequence number
+4.  Scan the RTT to check if free block is present, and spin/wait if it is.
+5.  Write data to this free block
+6.  Read map to get the existing post-map ABA entry for this pre-map ABA
+7.  Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
+8.  Write new post-map ABA into map.
+9.  Write old post-map entry into the free list
+10. Calculate next sequence number and write into the free list entry
+11. Release lane (and lane_lock)
+
+
+4. Error Handling
+=================
+
+An arena would be in an error state if any of the metadata is corrupted
+irrecoverably, either due to a bug or a media error. The following conditions
+indicate an error:
+
+- Info block checksum does not match (and recovering from the copy also fails)
+- All internal available blocks are not uniquely and entirely addressed by the
+  sum of mapped blocks and free blocks (from the BTT flog).
+- Rebuilding free list from the flog reveals missing/duplicate/impossible
+  entries
+- A map entry is out of bounds
+
+If any of these error conditions are encountered, the arena is put into a read
+only state using a flag in the info block.
+
+
+5. Usage
+========
+
+The BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem
+(pmem, or blk mode). The easiest way to set up such a namespace is using the
+'ndctl' utility [1]:
+
+For example, the ndctl command line to setup a btt with a 4k sector size is::
+
+    ndctl create-namespace -f -e namespace0.0 -m sector -l 4k
+
+See ndctl create-namespace --help for more options.
+
+[1]: https://github.com/pmem/ndctl
diff --git a/Documentation/driver-api/nvdimm/index.rst b/Documentation/driver-api/nvdimm/index.rst

new file mode 100644 (file)

index 0000000..19dc8ee
--- /dev/null
+++ b/Documentation/driver-api/nvdimm/index.rst
@@ -0,0 +1,10 @@
+===================================
+Non-Volatile Memory Device (NVDIMM)
+===================================
+
+.. toctree::
+   :maxdepth: 1
+
+   nvdimm
+   btt
+   security
diff --git a/Documentation/driver-api/nvdimm/nvdimm.rst b/Documentation/driver-api/nvdimm/nvdimm.rst

new file mode 100644 (file)

index 0000000..08f855c
--- /dev/null
+++ b/Documentation/driver-api/nvdimm/nvdimm.rst
@@ -0,0 +1,887 @@
+===============================
+LIBNVDIMM: Non-Volatile Devices
+===============================
+
+libnvdimm - kernel / libndctl - userspace helper library
+
+linux-nvdimm@lists.01.org
+
+Version 13
+
+.. contents:
+
+       Glossary
+       Overview
+           Supporting Documents
+           Git Trees
+       LIBNVDIMM PMEM and BLK
+       Why BLK?
+           PMEM vs BLK
+               BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
+       Example NVDIMM Platform
+       LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
+           LIBNDCTL: Context
+               libndctl: instantiate a new library context example
+           LIBNVDIMM/LIBNDCTL: Bus
+               libnvdimm: control class device in /sys/class
+               libnvdimm: bus
+               libndctl: bus enumeration example
+           LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
+               libnvdimm: DIMM (NMEM)
+               libndctl: DIMM enumeration example
+           LIBNVDIMM/LIBNDCTL: Region
+               libnvdimm: region
+               libndctl: region enumeration example
+               Why Not Encode the Region Type into the Region Name?
+               How Do I Determine the Major Type of a Region?
+           LIBNVDIMM/LIBNDCTL: Namespace
+               libnvdimm: namespace
+               libndctl: namespace enumeration example
+               libndctl: namespace creation example
+               Why the Term "namespace"?
+           LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
+               libnvdimm: btt layout
+               libndctl: btt creation example
+       Summary LIBNDCTL Diagram
+
+
+Glossary
+========
+
+PMEM:
+  A system-physical-address range where writes are persistent.  A
+  block device composed of PMEM is capable of DAX.  A PMEM address range
+  may span an interleave of several DIMMs.
+
+BLK:
+  A set of one or more programmable memory mapped apertures provided
+  by a DIMM to access its media.  This indirection precludes the
+  performance benefit of interleaving, but enables DIMM-bounded failure
+  modes.
+
+DPA:
+  DIMM Physical Address, is a DIMM-relative offset.  With one DIMM in
+  the system there would be a 1:1 system-physical-address:DPA association.
+  Once more DIMMs are added a memory controller interleave must be
+  decoded to determine the DPA associated with a given
+  system-physical-address.  BLK capacity always has a 1:1 relationship
+  with a single-DIMM's DPA range.
+
+DAX:
+  File system extensions to bypass the page cache and block layer to
+  mmap persistent memory, from a PMEM block device, directly into a
+  process address space.
+
+DSM:
+  Device Specific Method: ACPI method to to control specific
+  device - in this case the firmware.
+
+DCR:
+  NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5.
+  It defines a vendor-id, device-id, and interface format for a given DIMM.
+
+BTT:
+  Block Translation Table: Persistent memory is byte addressable.
+  Existing software may have an expectation that the power-fail-atomicity
+  of writes is at least one sector, 512 bytes.  The BTT is an indirection
+  table with atomic update semantics to front a PMEM/BLK block device
+  driver and present arbitrary atomic sector sizes.
+
+LABEL:
+  Metadata stored on a DIMM device that partitions and identifies
+  (persistently names) storage between PMEM and BLK.  It also partitions
+  BLK storage to host BTTs with different parameters per BLK-partition.
+  Note that traditional partition tables, GPT/MBR, are layered on top of a
+  BLK or PMEM device.
+
+
+Overview
+========
+
+The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely,
+PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM
+and BLK mode access.  These three modes of operation are described by
+the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6.  While the LIBNVDIMM
+implementation is generic and supports pre-NFIT platforms, it was guided
+by the superset of capabilities need to support this ACPI 6 definition
+for NVDIMM resources.  The bulk of the kernel implementation is in place
+to handle the case where DPA accessible via PMEM is aliased with DPA
+accessible via BLK.  When that occurs a LABEL is needed to reserve DPA
+for exclusive access via one mode a time.
+
+Supporting Documents
+--------------------
+
+ACPI 6:
+       http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
+NVDIMM Namespace:
+       http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
+DSM Interface Example:
+       http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+Driver Writer's Guide:
+       http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
+
+Git Trees
+---------
+
+LIBNVDIMM:
+       https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git
+LIBNDCTL:
+       https://github.com/pmem/ndctl.git
+PMEM:
+       https://github.com/01org/prd
+
+
+LIBNVDIMM PMEM and BLK
+======================
+
+Prior to the arrival of the NFIT, non-volatile memory was described to a
+system in various ad-hoc ways.  Usually only the bare minimum was
+provided, namely, a single system-physical-address range where writes
+are expected to be durable after a system power loss.  Now, the NFIT
+specification standardizes not only the description of PMEM, but also
+BLK and platform message-passing entry points for control and
+configuration.
+
+For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block
+device driver:
+
+    1. PMEM (nd_pmem.ko): Drives a system-physical-address range.  This
+       range is contiguous in system memory and may be interleaved (hardware
+       memory controller striped) across multiple DIMMs.  When interleaved the
+       platform may optionally provide details of which DIMMs are participating
+       in the interleave.
+
+       Note that while LIBNVDIMM describes system-physical-address ranges that may
+       alias with BLK access as ND_NAMESPACE_PMEM ranges and those without
+       alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no
+       distinction.  The different device-types are an implementation detail
+       that userspace can exploit to implement policies like "only interface
+       with address ranges from certain DIMMs".  It is worth noting that when
+       aliasing is present and a DIMM lacks a label, then no block device can
+       be created by default as userspace needs to do at least one allocation
+       of DPA to the PMEM range.  In contrast ND_NAMESPACE_IO ranges, once
+       registered, can be immediately attached to nd_pmem.
+
+    2. BLK (nd_blk.ko): This driver performs I/O using a set of platform
+       defined apertures.  A set of apertures will access just one DIMM.
+       Multiple windows (apertures) allow multiple concurrent accesses, much like
+       tagged-command-queuing, and would likely be used by different threads or
+       different CPUs.
+
+       The NFIT specification defines a standard format for a BLK-aperture, but
+       the spec also allows for vendor specific layouts, and non-NFIT BLK
+       implementations may have other designs for BLK I/O.  For this reason
+       "nd_blk" calls back into platform-specific code to perform the I/O.
+
+       One such implementation is defined in the "Driver Writer's Guide" and "DSM
+       Interface Example".
+
+
+Why BLK?
+========
+
+While PMEM provides direct byte-addressable CPU-load/store access to
+NVDIMM storage, it does not provide the best system RAS (recovery,
+availability, and serviceability) model.  An access to a corrupted
+system-physical-address address causes a CPU exception while an access
+to a corrupted address through an BLK-aperture causes that block window
+to raise an error status in a register.  The latter is more aligned with
+the standard error model that host-bus-adapter attached disks present.
+
+Also, if an administrator ever wants to replace a memory it is easier to
+service a system at DIMM module boundaries.  Compare this to PMEM where
+data could be interleaved in an opaque hardware specific manner across
+several DIMMs.
+
+PMEM vs BLK
+-----------
+
+BLK-apertures solve these RAS problems, but their presence is also the
+major contributing factor to the complexity of the ND subsystem.  They
+complicate the implementation because PMEM and BLK alias in DPA space.
+Any given DIMM's DPA-range may contribute to one or more
+system-physical-address sets of interleaved DIMMs, *and* may also be
+accessed in its entirety through its BLK-aperture.  Accessing a DPA
+through a system-physical-address while simultaneously accessing the
+same DPA through a BLK-aperture has undefined results.  For this reason,
+DIMMs with this dual interface configuration include a DSM function to
+store/retrieve a LABEL.  The LABEL effectively partitions the DPA-space
+into exclusive system-physical-address and BLK-aperture accessible
+regions.  For simplicity a DIMM is allowed a PMEM "region" per each
+interleave set in which it is a member.  The remaining DPA space can be
+carved into an arbitrary number of BLK devices with discontiguous
+extents.
+
+BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+One of the few
+reasons to allow multiple BLK namespaces per REGION is so that each
+BLK-namespace can be configured with a BTT with unique atomic sector
+sizes.  While a PMEM device can host a BTT the LABEL specification does
+not provide for a sector size to be specified for a PMEM namespace.
+
+This is due to the expectation that the primary usage model for PMEM is
+via DAX, and the BTT is incompatible with DAX.  However, for the cases
+where an application or filesystem still needs atomic sector update
+guarantees it can register a BTT on a PMEM device or partition.  See
+LIBNVDIMM/NDCTL: Block Translation Table "btt"
+
+
+Example NVDIMM Platform
+=======================
+
+For the remainder of this document the following diagram will be
+referenced for any example sysfs layouts::
+
+
+                               (a)               (b)           DIMM   BLK-REGION
+            +-------------------+--------+--------+--------+
+  +------+  |       pm0.0       | blk2.0 | pm1.0  | blk2.1 |    0      region2
+  | imc0 +--+- - - region0- - - +--------+        +--------+
+  +--+---+  |       pm0.0       | blk3.0 | pm1.0  | blk3.1 |    1      region3
+     |      +-------------------+--------v        v--------+
+  +--+---+                               |                 |
+  | cpu0 |                                     region1
+  +--+---+                               |                 |
+     |      +----------------------------^        ^--------+
+  +--+---+  |           blk4.0           | pm1.0  | blk4.0 |    2      region4
+  | imc1 +--+----------------------------|        +--------+
+  +------+  |           blk5.0           | pm1.0  | blk5.0 |    3      region5
+            +----------------------------+--------+--------+
+
+In this platform we have four DIMMs and two memory controllers in one
+socket.  Each unique interface (BLK or PMEM) to DPA space is identified
+by a region device with a dynamically assigned id (REGION0 - REGION5).
+
+    1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A
+       single PMEM namespace is created in the REGION0-SPA-range that spans most
+       of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that
+       interleaved system-physical-address range is reclaimed as BLK-aperture
+       accessed space starting at DPA-offset (a) into each DIMM.  In that
+       reclaimed space we create two BLK-aperture "namespaces" from REGION2 and
+       REGION3 where "blk2.0" and "blk3.0" are just human readable names that
+       could be set to any user-desired name in the LABEL.
+
+    2. In the last portion of DIMM0 and DIMM1 we have an interleaved
+       system-physical-address range, REGION1, that spans those two DIMMs as
+       well as DIMM2 and DIMM3.  Some of REGION1 is allocated to a PMEM namespace
+       named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for
+       each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
+       "blk5.0".
+
+    3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1
+       interleaved system-physical-address range (i.e. the DPA address past
+       offset (b) are also included in the "blk4.0" and "blk5.0" namespaces.
+       Note, that this example shows that BLK-aperture namespaces don't need to
+       be contiguous in DPA-space.
+
+    This bus is provided by the kernel under the device
+    /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and
+    the nfit_test.ko module is loaded.  This not only test LIBNVDIMM but the
+    acpi_nfit.ko driver as well.
+
+
+LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
+========================================================
+
+What follows is a description of the LIBNVDIMM sysfs layout and a
+corresponding object hierarchy diagram as viewed through the LIBNDCTL
+API.  The example sysfs paths and diagrams are relative to the Example
+NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit
+test.
+
+LIBNDCTL: Context
+-----------------
+
+Every API call in the LIBNDCTL library requires a context that holds the
+logging parameters and other library instance state.  The library is
+based on the libabc template:
+
+       https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git
+
+LIBNDCTL: instantiate a new library context example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+       struct ndctl_ctx *ctx;
+
+       if (ndctl_new(&ctx) == 0)
+               return ctx;
+       else
+               return NULL;
+
+LIBNVDIMM/LIBNDCTL: Bus
+-----------------------
+
+A bus has a 1:1 relationship with an NFIT.  The current expectation for
+ACPI based systems is that there is only ever one platform-global NFIT.
+That said, it is trivial to register multiple NFITs, the specification
+does not preclude it.  The infrastructure supports multiple busses and
+we use this capability to test multiple NFIT configurations in the unit
+test.
+
+LIBNVDIMM: control class device in /sys/class
+---------------------------------------------
+
+This character device accepts DSM messages to be passed to DIMM
+identified by its NFIT handle::
+
+       /sys/class/nd/ndctl0
+       |-- dev
+       |-- device -> ../../../ndbus0
+       |-- subsystem -> ../../../../../../../class/nd
+
+
+
+LIBNVDIMM: bus
+--------------
+
+::
+
+       struct nvdimm_bus *nvdimm_bus_register(struct device *parent,
+              struct nvdimm_bus_descriptor *nfit_desc);
+
+::
+
+       /sys/devices/platform/nfit_test.0/ndbus0
+       |-- commands
+       |-- nd
+       |-- nfit
+       |-- nmem0
+       |-- nmem1
+       |-- nmem2
+       |-- nmem3
+       |-- power
+       |-- provider
+       |-- region0
+       |-- region1
+       |-- region2
+       |-- region3
+       |-- region4
+       |-- region5
+       |-- uevent
+       `-- wait_probe
+
+LIBNDCTL: bus enumeration example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Find the bus handle that describes the bus from Example NVDIMM Platform::
+
+       static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx,
+                       const char *provider)
+       {
+               struct ndctl_bus *bus;
+
+               ndctl_bus_foreach(ctx, bus)
+                       if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0)
+                               return bus;
+
+               return NULL;
+       }
+
+       bus = get_bus_by_provider(ctx, "nfit_test.0");
+
+
+LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
+-------------------------------
+
+The DIMM device provides a character device for sending commands to
+hardware, and it is a container for LABELs.  If the DIMM is defined by
+NFIT then an optional 'nfit' attribute sub-directory is available to add
+NFIT-specifics.
+
+Note that the kernel device name for "DIMMs" is "nmemX".  The NFIT
+describes these devices via "Memory Device to System Physical Address
+Range Mapping Structure", and there is no requirement that they actually
+be physical DIMMs, so we use a more generic name.
+
+LIBNVDIMM: DIMM (NMEM)
+^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+       struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data,
+                       const struct attribute_group **groups, unsigned long flags,
+                       unsigned long *dsm_mask);
+
+::
+
+       /sys/devices/platform/nfit_test.0/ndbus0
+       |-- nmem0
+       |   |-- available_slots
+       |   |-- commands
+       |   |-- dev
+       |   |-- devtype
+       |   |-- driver -> ../../../../../bus/nd/drivers/nvdimm
+       |   |-- modalias
+       |   |-- nfit
+       |   |   |-- device
+       |   |   |-- format
+       |   |   |-- handle
+       |   |   |-- phys_id
+       |   |   |-- rev_id
+       |   |   |-- serial
+       |   |   `-- vendor
+       |   |-- state
+       |   |-- subsystem -> ../../../../../bus/nd
+       |   `-- uevent
+       |-- nmem1
+       [..]
+
+
+LIBNDCTL: DIMM enumeration example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Note, in this example we are assuming NFIT-defined DIMMs which are
+identified by an "nfit_handle" a 32-bit value where:
+
+   - Bit 3:0 DIMM number within the memory channel
+   - Bit 7:4 memory channel number
+   - Bit 11:8 memory controller ID
+   - Bit 15:12 socket ID (within scope of a Node controller if node
+     controller is present)
+   - Bit 27:16 Node Controller ID
+   - Bit 31:28 Reserved
+
+::
+
+       static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus,
+              unsigned int handle)
+       {
+               struct ndctl_dimm *dimm;
+
+               ndctl_dimm_foreach(bus, dimm)
+                       if (ndctl_dimm_get_handle(dimm) == handle)
+                               return dimm;
+
+               return NULL;
+       }
+
+       #define DIMM_HANDLE(n, s, i, c, d) \
+               (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \
+                | ((c & 0xf) << 4) | (d & 0xf))
+
+       dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0));
+
+LIBNVDIMM/LIBNDCTL: Region
+--------------------------
+
+A generic REGION device is registered for each PMEM range or BLK-aperture
+set.  Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture
+sets on the "nfit_test.0" bus.  The primary role of regions are to be a
+container of "mappings".  A mapping is a tuple of <DIMM,
+DPA-start-offset, length>.
+
+LIBNVDIMM provides a built-in driver for these REGION devices.  This driver
+is responsible for reconciling the aliased DPA mappings across all
+regions, parsing the LABEL, if present, and then emitting NAMESPACE
+devices with the resolved/exclusive DPA-boundaries for the nd_pmem or
+nd_blk device driver to consume.
+
+In addition to the generic attributes of "mapping"s, "interleave_ways"
+and "size" the REGION device also exports some convenience attributes.
+"nstype" indicates the integer type of namespace-device this region
+emits, "devtype" duplicates the DEVTYPE variable stored by udev at the
+'add' event, "modalias" duplicates the MODALIAS variable stored by udev
+at the 'add' event, and finally, the optional "spa_index" is provided in
+the case where the region is defined by a SPA.
+
+LIBNVDIMM: region::
+
+       struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus,
+                       struct nd_region_desc *ndr_desc);
+       struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
+                       struct nd_region_desc *ndr_desc);
+
+::
+
+       /sys/devices/platform/nfit_test.0/ndbus0
+       |-- region0
+       |   |-- available_size
+       |   |-- btt0
+       |   |-- btt_seed
+       |   |-- devtype
+       |   |-- driver -> ../../../../../bus/nd/drivers/nd_region
+       |   |-- init_namespaces
+       |   |-- mapping0
+       |   |-- mapping1
+       |   |-- mappings
+       |   |-- modalias
+       |   |-- namespace0.0
+       |   |-- namespace_seed
+       |   |-- numa_node
+       |   |-- nfit
+       |   |   `-- spa_index
+       |   |-- nstype
+       |   |-- set_cookie
+       |   |-- size
+       |   |-- subsystem -> ../../../../../bus/nd
+       |   `-- uevent
+       |-- region1
+       [..]
+
+LIBNDCTL: region enumeration example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Sample region retrieval routines based on NFIT-unique data like
+"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for
+BLK::
+
+       static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus,
+                       unsigned int spa_index)
+       {
+               struct ndctl_region *region;
+
+               ndctl_region_foreach(bus, region) {
+                       if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM)
+                               continue;
+                       if (ndctl_region_get_spa_index(region) == spa_index)
+                               return region;
+               }
+               return NULL;
+       }
+
+       static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus,
+                       unsigned int handle)
+       {
+               struct ndctl_region *region;
+
+               ndctl_region_foreach(bus, region) {
+                       struct ndctl_mapping *map;
+
+                       if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK)
+                               continue;
+                       ndctl_mapping_foreach(region, map) {
+                               struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map);
+
+                               if (ndctl_dimm_get_handle(dimm) == handle)
+                                       return region;
+                       }
+               }
+               return NULL;
+       }
+
+
+Why Not Encode the Region Type into the Region Name?
+----------------------------------------------------
+
+At first glance it seems since NFIT defines just PMEM and BLK interface
+types that we should simply name REGION devices with something derived
+from those type names.  However, the ND subsystem explicitly keeps the
+REGION name generic and expects userspace to always consider the
+region-attributes for four reasons:
+
+    1. There are already more than two REGION and "namespace" types.  For
+       PMEM there are two subtypes.  As mentioned previously we have PMEM where
+       the constituent DIMM devices are known and anonymous PMEM.  For BLK
+       regions the NFIT specification already anticipates vendor specific
+       implementations.  The exact distinction of what a region contains is in
+       the region-attributes not the region-name or the region-devtype.
+
+    2. A region with zero child-namespaces is a possible configuration.  For
+       example, the NFIT allows for a DCR to be published without a
+       corresponding BLK-aperture.  This equates to a DIMM that can only accept
+       control/configuration messages, but no i/o through a descendant block
+       device.  Again, this "type" is advertised in the attributes ('mappings'
+       == 0) and the name does not tell you much.
+
+    3. What if a third major interface type arises in the future?  Outside
+       of vendor specific implementations, it's not difficult to envision a
+       third class of interface type beyond BLK and PMEM.  With a generic name
+       for the REGION level of the device-hierarchy old userspace
+       implementations can still make sense of new kernel advertised
+       region-types.  Userspace can always rely on the generic region
+       attributes like "mappings", "size", etc and the expected child devices
+       named "namespace".  This generic format of the device-model hierarchy
+       allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and
+       future-proof.
+
+    4. There are more robust mechanisms for determining the major type of a
+       region than a device name.  See the next section, How Do I Determine the
+       Major Type of a Region?
+
+How Do I Determine the Major Type of a Region?
+----------------------------------------------
+
+Outside of the blanket recommendation of "use libndctl", or simply
+looking at the kernel header (/usr/include/linux/ndctl.h) to decode the
+"nstype" integer attribute, here are some other options.
+
+1. module alias lookup
+^^^^^^^^^^^^^^^^^^^^^^
+
+    The whole point of region/namespace device type differentiation is to
+    decide which block-device driver will attach to a given LIBNVDIMM namespace.
+    One can simply use the modalias to lookup the resulting module.  It's
+    important to note that this method is robust in the presence of a
+    vendor-specific driver down the road.  If a vendor-specific
+    implementation wants to supplant the standard nd_blk driver it can with
+    minimal impact to the rest of LIBNVDIMM.
+
+    In fact, a vendor may also want to have a vendor-specific region-driver
+    (outside of nd_region).  For example, if a vendor defined its own LABEL
+    format it would need its own region driver to parse that LABEL and emit
+    the resulting namespaces.  The output from module resolution is more
+    accurate than a region-name or region-devtype.
+
+2. udev
+^^^^^^^
+
+    The kernel "devtype" is registered in the udev database::
+
+       # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0
+       P: /devices/platform/nfit_test.0/ndbus0/region0
+       E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0
+       E: DEVTYPE=nd_pmem
+       E: MODALIAS=nd:t2
+       E: SUBSYSTEM=nd
+
+       # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4
+       P: /devices/platform/nfit_test.0/ndbus0/region4
+       E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4
+       E: DEVTYPE=nd_blk
+       E: MODALIAS=nd:t3
+       E: SUBSYSTEM=nd
+
+    ...and is available as a region attribute, but keep in mind that the
+    "devtype" does not indicate sub-type variations and scripts should
+    really be understanding the other attributes.
+
+3. type specific attributes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+    As it currently stands a BLK-aperture region will never have a
+    "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region.  A
+    BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM
+    that does not allow I/O.  A PMEM region with a "mappings" value of zero
+    is a simple system-physical-address range.
+
+
+LIBNVDIMM/LIBNDCTL: Namespace
+-----------------------------
+
+A REGION, after resolving DPA aliasing and LABEL specified boundaries,
+surfaces one or more "namespace" devices.  The arrival of a "namespace"
+device currently triggers either the nd_blk or nd_pmem driver to load
+and register a disk/block device.
+
+LIBNVDIMM: namespace
+^^^^^^^^^^^^^^^^^^^^
+
+Here is a sample layout from the three major types of NAMESPACE where
+namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid'
+attribute), namespace2.0 represents a BLK namespace (note it has a
+'sector_size' attribute) that, and namespace6.0 represents an anonymous
+PMEM namespace (note that has no 'uuid' attribute due to not support a
+LABEL)::
+
+       /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0
+       |-- alt_name
+       |-- devtype
+       |-- dpa_extents
+       |-- force_raw
+       |-- modalias
+       |-- numa_node
+       |-- resource
+       |-- size
+       |-- subsystem -> ../../../../../../bus/nd
+       |-- type
+       |-- uevent
+       `-- uuid
+       /sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0
+       |-- alt_name
+       |-- devtype
+       |-- dpa_extents
+       |-- force_raw
+       |-- modalias
+       |-- numa_node
+       |-- sector_size
+       |-- size
+       |-- subsystem -> ../../../../../../bus/nd
+       |-- type
+       |-- uevent
+       `-- uuid
+       /sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0
+       |-- block
+       |   `-- pmem0
+       |-- devtype
+       |-- driver -> ../../../../../../bus/nd/drivers/pmem
+       |-- force_raw
+       |-- modalias
+       |-- numa_node
+       |-- resource
+       |-- size
+       |-- subsystem -> ../../../../../../bus/nd
+       |-- type
+       `-- uevent
+
+LIBNDCTL: namespace enumeration example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Namespaces are indexed relative to their parent region, example below.
+These indexes are mostly static from boot to boot, but subsystem makes
+no guarantees in this regard.  For a static namespace identifier use its
+'uuid' attribute.
+
+::
+
+  static struct ndctl_namespace
+  *get_namespace_by_id(struct ndctl_region *region, unsigned int id)
+  {
+          struct ndctl_namespace *ndns;
+
+          ndctl_namespace_foreach(region, ndns)
+                  if (ndctl_namespace_get_id(ndns) == id)
+                          return ndns;
+
+          return NULL;
+  }
+
+LIBNDCTL: namespace creation example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Idle namespaces are automatically created by the kernel if a given
+region has enough available capacity to create a new namespace.
+Namespace instantiation involves finding an idle namespace and
+configuring it.  For the most part the setting of namespace attributes
+can occur in any order, the only constraint is that 'uuid' must be set
+before 'size'.  This enables the kernel to track DPA allocations
+internally with a static identifier::
+
+  static int configure_namespace(struct ndctl_region *region,
+                  struct ndctl_namespace *ndns,
+                  struct namespace_parameters *parameters)
+  {
+          char devname[50];
+
+          snprintf(devname, sizeof(devname), "namespace%d.%d",
+                          ndctl_region_get_id(region), paramaters->id);
+
+          ndctl_namespace_set_alt_name(ndns, devname);
+          /* 'uuid' must be set prior to setting size! */
+          ndctl_namespace_set_uuid(ndns, paramaters->uuid);
+          ndctl_namespace_set_size(ndns, paramaters->size);
+          /* unlike pmem namespaces, blk namespaces have a sector size */
+          if (parameters->lbasize)
+                  ndctl_namespace_set_sector_size(ndns, parameters->lbasize);
+          ndctl_namespace_enable(ndns);
+  }
+
+
+Why the Term "namespace"?
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+    1. Why not "volume" for instance?  "volume" ran the risk of confusing
+       ND (libnvdimm subsystem) to a volume manager like device-mapper.
+
+    2. The term originated to describe the sub-devices that can be created
+       within a NVME controller (see the nvme specification:
+       http://www.nvmexpress.org/specifications/), and NFIT namespaces are
+       meant to parallel the capabilities and configurability of
+       NVME-namespaces.
+
+
+LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
+-------------------------------------------------
+
+A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked
+block device driver that fronts either the whole block device or a
+partition of a block device emitted by either a PMEM or BLK NAMESPACE.
+
+LIBNVDIMM: btt layout
+^^^^^^^^^^^^^^^^^^^^^
+
+Every region will start out with at least one BTT device which is the
+seed device.  To activate it set the "namespace", "uuid", and
+"sector_size" attributes and then bind the device to the nd_pmem or
+nd_blk driver depending on the region type::
+
+       /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/
+       |-- namespace
+       |-- delete
+       |-- devtype
+       |-- modalias
+       |-- numa_node
+       |-- sector_size
+       |-- subsystem -> ../../../../../bus/nd
+       |-- uevent
+       `-- uuid
+
+LIBNDCTL: btt creation example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Similar to namespaces an idle BTT device is automatically created per
+region.  Each time this "seed" btt device is configured and enabled a new
+seed is created.  Creating a BTT configuration involves two steps of
+finding and idle BTT and assigning it to consume a PMEM or BLK namespace::
+
+       static struct ndctl_btt *get_idle_btt(struct ndctl_region *region)
+       {
+               struct ndctl_btt *btt;
+
+               ndctl_btt_foreach(region, btt)
+                       if (!ndctl_btt_is_enabled(btt)
+                                       && !ndctl_btt_is_configured(btt))
+                               return btt;
+
+               return NULL;
+       }
+
+       static int configure_btt(struct ndctl_region *region,
+                       struct btt_parameters *parameters)
+       {
+               btt = get_idle_btt(region);
+
+               ndctl_btt_set_uuid(btt, parameters->uuid);
+               ndctl_btt_set_sector_size(btt, parameters->sector_size);
+               ndctl_btt_set_namespace(btt, parameters->ndns);
+               /* turn off raw mode device */
+               ndctl_namespace_disable(parameters->ndns);
+               /* turn on btt access */
+               ndctl_btt_enable(btt);
+       }
+
+Once instantiated a new inactive btt seed device will appear underneath
+the region.
+
+Once a "namespace" is removed from a BTT that instance of the BTT device
+will be deleted or otherwise reset to default values.  This deletion is
+only at the device model level.  In order to destroy a BTT the "info
+block" needs to be destroyed.  Note, that to destroy a BTT the media
+needs to be written in raw mode.  By default, the kernel will autodetect
+the presence of a BTT and disable raw mode.  This autodetect behavior
+can be suppressed by enabling raw mode for the namespace via the
+ndctl_namespace_set_raw_mode() API.
+
+
+Summary LIBNDCTL Diagram
+------------------------
+
+For the given example above, here is the view of the objects as seen by the
+LIBNDCTL API::
+
+              +---+
+              |CTX|    +---------+   +--------------+  +---------------+
+              +-+-+  +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" |
+                |    | +---------+   +--------------+  +---------------+
+  +-------+     |    | +---------+   +--------------+  +---------------+
+  | DIMM0 <-+   |    +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" |
+  +-------+ |   |    | +---------+   +--------------+  +---------------+
+  | DIMM1 <-+ +-v--+ | +---------+   +--------------+  +---------------+
+  +-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6  "blk2.0" |
+  | DIMM2 <-+ +----+ | +---------+ | +--------------+  +----------------------+
+  +-------+ |        |             +-> NAMESPACE2.1 +--> ND5  "blk2.1" | BTT2 |
+  | DIMM3 <-+        |               +--------------+  +----------------------+
+  +-------+          | +---------+   +--------------+  +---------------+
+                     +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4  "blk3.0" |
+                     | +---------+ | +--------------+  +----------------------+
+                     |             +-> NAMESPACE3.1 +--> ND3  "blk3.1" | BTT1 |
+                     |               +--------------+  +----------------------+
+                     | +---------+   +--------------+  +---------------+
+                     +-> REGION4 +---> NAMESPACE4.0 +--> ND2  "blk4.0" |
+                     | +---------+   +--------------+  +---------------+
+                     | +---------+   +--------------+  +----------------------+
+                     +-> REGION5 +---> NAMESPACE5.0 +--> ND1  "blk5.0" | BTT0 |
+                       +---------+   +--------------+  +---------------+------+
diff --git a/Documentation/driver-api/nvdimm/security.rst b/Documentation/driver-api/nvdimm/security.rst

new file mode 100644 (file)

index 0000000..ad9dea0
--- /dev/null
+++ b/Documentation/driver-api/nvdimm/security.rst
@@ -0,0 +1,143 @@
+===============
+NVDIMM Security
+===============
+
+1. Introduction
+---------------
+
+With the introduction of Intel Device Specific Methods (DSM) v1.8
+specification [1], security DSMs are introduced. The spec added the following
+security DSMs: "get security state", "set passphrase", "disable passphrase",
+"unlock unit", "freeze lock", "secure erase", and "overwrite". A security_ops
+data structure has been added to struct dimm in order to support the security
+operations and generic APIs are exposed to allow vendor neutral operations.
+
+2. Sysfs Interface
+------------------
+The "security" sysfs attribute is provided in the nvdimm sysfs directory. For
+example:
+/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/security
+
+The "show" attribute of that attribute will display the security state for
+that DIMM. The following states are available: disabled, unlocked, locked,
+frozen, and overwrite. If security is not supported, the sysfs attribute
+will not be visible.
+
+The "store" attribute takes several commands when it is being written to
+in order to support some of the security functionalities:
+update <old_keyid> <new_keyid> - enable or update passphrase.
+disable <keyid> - disable enabled security and remove key.
+freeze - freeze changing of security states.
+erase <keyid> - delete existing user encryption key.
+overwrite <keyid> - wipe the entire nvdimm.
+master_update <keyid> <new_keyid> - enable or update master passphrase.
+master_erase <keyid> - delete existing user encryption key.
+
+3. Key Management
+-----------------
+
+The key is associated to the payload by the DIMM id. For example:
+# cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/nfit/id
+8089-a2-1740-00000133
+The DIMM id would be provided along with the key payload (passphrase) to
+the kernel.
+
+The security keys are managed on the basis of a single key per DIMM. The
+key "passphrase" is expected to be 32bytes long. This is similar to the ATA
+security specification [2]. A key is initially acquired via the request_key()
+kernel API call during nvdimm unlock. It is up to the user to make sure that
+all the keys are in the kernel user keyring for unlock.
+
+A nvdimm encrypted-key of format enc32 has the description format of:
+nvdimm:<bus-provider-specific-unique-id>
+
+See file ``Documentation/security/keys/trusted-encrypted.rst`` for creating
+encrypted-keys of enc32 format. TPM usage with a master trusted key is
+preferred for sealing the encrypted-keys.
+
+4. Unlocking
+------------
+When the DIMMs are being enumerated by the kernel, the kernel will attempt to
+retrieve the key from the kernel user keyring. This is the only time
+a locked DIMM can be unlocked. Once unlocked, the DIMM will remain unlocked
+until reboot. Typically an entity (i.e. shell script) will inject all the
+relevant encrypted-keys into the kernel user keyring during the initramfs phase.
+This provides the unlock function access to all the related keys that contain
+the passphrase for the respective nvdimms.  It is also recommended that the
+keys are injected before libnvdimm is loaded by modprobe.
+
+5. Update
+---------
+When doing an update, it is expected that the existing key is removed from
+the kernel user keyring and reinjected as different (old) key. It's irrelevant
+what the key description is for the old key since we are only interested in the
+keyid when doing the update operation. It is also expected that the new key
+is injected with the description format described from earlier in this
+document.  The update command written to the sysfs attribute will be with
+the format:
+update <old keyid> <new keyid>
+
+If there is no old keyid due to a security enabling, then a 0 should be
+passed in.
+
+6. Freeze
+---------
+The freeze operation does not require any keys. The security config can be
+frozen by a user with root privelege.
+
+7. Disable
+----------
+The security disable command format is:
+disable <keyid>
+
+An key with the current passphrase payload that is tied to the nvdimm should be
+in the kernel user keyring.
+
+8. Secure Erase
+---------------
+The command format for doing a secure erase is:
+erase <keyid>
+
+An key with the current passphrase payload that is tied to the nvdimm should be
+in the kernel user keyring.
+
+9. Overwrite
+------------
+The command format for doing an overwrite is:
+overwrite <keyid>
+
+Overwrite can be done without a key if security is not enabled. A key serial
+of 0 can be passed in to indicate no key.
+
+The sysfs attribute "security" can be polled to wait on overwrite completion.
+Overwrite can last tens of minutes or more depending on nvdimm size.
+
+An encrypted-key with the current user passphrase that is tied to the nvdimm
+should be injected and its keyid should be passed in via sysfs.
+
+10. Master Update
+-----------------
+The command format for doing a master update is:
+update <old keyid> <new keyid>
+
+The operating mechanism for master update is identical to update except the
+master passphrase key is passed to the kernel. The master passphrase key
+is just another encrypted-key.
+
+This command is only available when security is disabled.
+
+11. Master Erase
+----------------
+The command format for doing a master erase is:
+master_erase <current keyid>
+
+This command has the same operating mechanism as erase except the master
+passphrase key is passed to the kernel. The master passphrase key is just
+another encrypted-key.
+
+This command is only available when the master security is enabled, indicated
+by the extended security status.
+
+[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf
+
+[2]: http://www.t13.org/documents/UploadedDocuments/docs2006/e05179r4-ACS-SecurityClarifications.pdf
diff --git a/Documentation/nvdimm/btt.rst b/Documentation/nvdimm/btt.rst

deleted file mode 100644 (file)

index 2d8269f..0000000
--- a/Documentation/nvdimm/btt.rst
+++ /dev/null
@@ -1,285 +0,0 @@
-=============================
-BTT - Block Translation Table
-=============================
-
-
-1. Introduction
-===============
-
-Persistent memory based storage is able to perform IO at byte (or more
-accurately, cache line) granularity. However, we often want to expose such
-storage as traditional block devices. The block drivers for persistent memory
-will do exactly this. However, they do not provide any atomicity guarantees.
-Traditional SSDs typically provide protection against torn sectors in hardware,
-using stored energy in capacitors to complete in-flight block writes, or perhaps
-in firmware. We don't have this luxury with persistent memory - if a write is in
-progress, and we experience a power failure, the block will contain a mix of old
-and new data. Applications may not be prepared to handle such a scenario.
-
-The Block Translation Table (BTT) provides atomic sector update semantics for
-persistent memory devices, so that applications that rely on sector writes not
-being torn can continue to do so. The BTT manifests itself as a stacked block
-device, and reserves a portion of the underlying storage for its metadata. At
-the heart of it, is an indirection table that re-maps all the blocks on the
-volume. It can be thought of as an extremely simple file system that only
-provides atomic sector updates.
-
-
-2. Static Layout
-================
-
-The underlying storage on which a BTT can be laid out is not limited in any way.
-The BTT, however, splits the available space into chunks of up to 512 GiB,
-called "Arenas".
-
-Each arena follows the same layout for its metadata, and all references in an
-arena are internal to it (with the exception of one field that points to the
-next arena). The following depicts the "On-disk" metadata layout::
-
-
-    Backing Store     +------->  Arena
-  +---------------+   |   +------------------+
-  |               |   |   | Arena info block |
-  |    Arena 0    +---+   |       4K         |
-  |     512G      |       +------------------+
-  |               |       |                  |
-  +---------------+       |                  |
-  |               |       |                  |
-  |    Arena 1    |       |   Data Blocks    |
-  |     512G      |       |                  |
-  |               |       |                  |
-  +---------------+       |                  |
-  |       .       |       |                  |
-  |       .       |       |                  |
-  |       .       |       |                  |
-  |               |       |                  |
-  |               |       |                  |
-  +---------------+       +------------------+
-                          |                  |
-                          |     BTT Map      |
-                          |                  |
-                          |                  |
-                          +------------------+
-                          |                  |
-                          |     BTT Flog     |
-                          |                  |
-                          +------------------+
-                          | Info block copy  |
-                          |       4K         |
-                          +------------------+
-
-
-3. Theory of Operation
-======================
-
-
-a. The BTT Map
---------------
-
-The map is a simple lookup/indirection table that maps an LBA to an internal
-block. Each map entry is 32 bits. The two most significant bits are special
-flags, and the remaining form the internal block number.
-
-======== =============================================================
-Bit      Description
-======== =============================================================
-31 - 30         Error and Zero flags - Used in the following way:
-
-          == ==  ====================================================
-          31 30  Description
-          == ==  ====================================================
-          0  0   Initial state. Reads return zeroes; Premap = Postmap
-          0  1   Zero state: Reads return zeroes
-          1  0   Error state: Reads fail; Writes clear 'E' bit
-          1  1   Normal Block – has valid postmap
-          == ==  ====================================================
-
-29 - 0  Mappings to internal 'postmap' blocks
-======== =============================================================
-
-
-Some of the terminology that will be subsequently used:
-
-============   ================================================================
-External LBA   LBA as made visible to upper layers.
-ABA            Arena Block Address - Block offset/number within an arena
-Premap ABA     The block offset into an arena, which was decided upon by range
-               checking the External LBA
-Postmap ABA    The block number in the "Data Blocks" area obtained after
-               indirection from the map
-nfree          The number of free blocks that are maintained at any given time.
-               This is the number of concurrent writes that can happen to the
-               arena.
-============   ================================================================
-
-
-For example, after adding a BTT, we surface a disk of 1024G. We get a read for
-the external LBA at 768G. This falls into the second arena, and of the 512G
-worth of blocks that this arena contributes, this block is at 256G. Thus, the
-premap ABA is 256G. We now refer to the map, and find out the mapping for block
-'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
-
-
-b. The BTT Flog
----------------
-
-The BTT provides sector atomicity by making every write an "allocating write",
-i.e. Every write goes to a "free" block. A running list of free blocks is
-maintained in the form of the BTT flog. 'Flog' is a combination of the words
-"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
-
-========  =====================================================================
-lba       The premap ABA that is being written to
-old_map   The old postmap ABA - after 'this' write completes, this will be a
-         free block.
-new_map   The new postmap ABA. The map will up updated to reflect this
-         lba->postmap_aba mapping, but we log it here in case we have to
-         recover.
-seq      Sequence number to mark which of the 2 sections of this flog entry is
-         valid/newest. It cycles between 01->10->11->01 (binary) under normal
-         operation, with 00 indicating an uninitialized state.
-lba'     alternate lba entry
-old_map'  alternate old postmap entry
-new_map'  alternate new postmap entry
-seq'     alternate sequence number.
-========  =====================================================================
-
-Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also
-padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are
-done such that for any entry being written, it:
-a. overwrites the 'old' section in the entry based on sequence numbers
-b. writes the 'new' section such that the sequence number is written last.
-
-
-c. The concept of lanes
------------------------
-
-While 'nfree' describes the number of concurrent IOs an arena can process
-concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
-process::
-
-       nlanes = min(nfree, num_cpus)
-
-A lane number is obtained at the start of any IO, and is used for indexing into
-all the on-disk and in-memory data structures for the duration of the IO. If
-there are more CPUs than the max number of available lanes, than lanes are
-protected by spinlocks.
-
-
-d. In-memory data structure: Read Tracking Table (RTT)
-------------------------------------------------------
-
-Consider a case where we have two threads, one doing reads and the other,
-writes. We can hit a condition where the writer thread grabs a free block to do
-a new IO, but the (slow) reader thread is still reading from it. In other words,
-the reader consulted a map entry, and started reading the corresponding block. A
-writer started writing to the same external LBA, and finished the write updating
-the map for that external LBA to point to its new postmap ABA. At this point the
-internal, postmap block that the reader is (still) reading has been inserted
-into the list of free blocks. If another write comes in for the same LBA, it can
-grab this free block, and start writing to it, causing the reader to read
-incorrect data. To prevent this, we introduce the RTT.
-
-The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
-into rtt[lane_number], the postmap ABA it is reading, and clears it after the
-read is complete. Every writer thread, after grabbing a free block, checks the
-RTT for its presence. If the postmap free block is in the RTT, it waits till the
-reader clears the RTT entry, and only then starts writing to it.
-
-
-e. In-memory data structure: map locks
---------------------------------------
-
-Consider a case where two writer threads are writing to the same LBA. There can
-be a race in the following sequence of steps::
-
-       free[lane] = map[premap_aba]
-       map[premap_aba] = postmap_aba
-
-Both threads can update their respective free[lane] with the same old, freed
-postmap_aba. This has made the layout inconsistent by losing a free entry, and
-at the same time, duplicating another free entry for two lanes.
-
-To solve this, we could have a single map lock (per arena) that has to be taken
-before performing the above sequence, but we feel that could be too contentious.
-Instead we use an array of (nfree) map_locks that is indexed by
-(premap_aba modulo nfree).
-
-
-f. Reconstruction from the Flog
--------------------------------
-
-On startup, we analyze the BTT flog to create our list of free blocks. We walk
-through all the entries, and for each lane, of the set of two possible
-'sections', we always look at the most recent one only (based on the sequence
-number). The reconstruction rules/steps are simple:
-
-- Read map[log_entry.lba].
-- If log_entry.new matches the map entry, then log_entry.old is free.
-- If log_entry.new does not match the map entry, then log_entry.new is free.
-  (This case can only be caused by power-fails/unsafe shutdowns)
-
-
-g. Summarizing - Read and Write flows
--------------------------------------
-
-Read:
-
-1.  Convert external LBA to arena number + pre-map ABA
-2.  Get a lane (and take lane_lock)
-3.  Read map to get the entry for this pre-map ABA
-4.  Enter post-map ABA into RTT[lane]
-5.  If TRIM flag set in map, return zeroes, and end IO (go to step 8)
-6.  If ERROR flag set in map, end IO with EIO (go to step 8)
-7.  Read data from this block
-8.  Remove post-map ABA entry from RTT[lane]
-9.  Release lane (and lane_lock)
-
-Write:
-
-1.  Convert external LBA to Arena number + pre-map ABA
-2.  Get a lane (and take lane_lock)
-3.  Use lane to index into in-memory free list and obtain a new block, next flog
-    index, next sequence number
-4.  Scan the RTT to check if free block is present, and spin/wait if it is.
-5.  Write data to this free block
-6.  Read map to get the existing post-map ABA entry for this pre-map ABA
-7.  Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
-8.  Write new post-map ABA into map.
-9.  Write old post-map entry into the free list
-10. Calculate next sequence number and write into the free list entry
-11. Release lane (and lane_lock)
-
-
-4. Error Handling
-=================
-
-An arena would be in an error state if any of the metadata is corrupted
-irrecoverably, either due to a bug or a media error. The following conditions
-indicate an error:
-
-- Info block checksum does not match (and recovering from the copy also fails)
-- All internal available blocks are not uniquely and entirely addressed by the
-  sum of mapped blocks and free blocks (from the BTT flog).
-- Rebuilding free list from the flog reveals missing/duplicate/impossible
-  entries
-- A map entry is out of bounds
-
-If any of these error conditions are encountered, the arena is put into a read
-only state using a flag in the info block.
-
-
-5. Usage
-========
-
-The BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem
-(pmem, or blk mode). The easiest way to set up such a namespace is using the
-'ndctl' utility [1]:
-
-For example, the ndctl command line to setup a btt with a 4k sector size is::
-
-    ndctl create-namespace -f -e namespace0.0 -m sector -l 4k
-
-See ndctl create-namespace --help for more options.
-
-[1]: https://github.com/pmem/ndctl
diff --git a/Documentation/nvdimm/index.rst b/Documentation/nvdimm/index.rst

deleted file mode 100644 (file)

index 1a3402d..0000000
--- a/Documentation/nvdimm/index.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-:orphan:
-
-===================================
-Non-Volatile Memory Device (NVDIMM)
-===================================
-
-.. toctree::
-   :maxdepth: 1
-
-   nvdimm
-   btt
-   security
diff --git a/Documentation/nvdimm/nvdimm.rst b/Documentation/nvdimm/nvdimm.rst

deleted file mode 100644 (file)

index 08f855c..0000000
--- a/Documentation/nvdimm/nvdimm.rst
+++ /dev/null
@@ -1,887 +0,0 @@
-===============================
-LIBNVDIMM: Non-Volatile Devices
-===============================
-
-libnvdimm - kernel / libndctl - userspace helper library
-
-linux-nvdimm@lists.01.org
-
-Version 13
-
-.. contents:
-
-       Glossary
-       Overview
-           Supporting Documents
-           Git Trees
-       LIBNVDIMM PMEM and BLK
-       Why BLK?
-           PMEM vs BLK
-               BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
-       Example NVDIMM Platform
-       LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
-           LIBNDCTL: Context
-               libndctl: instantiate a new library context example
-           LIBNVDIMM/LIBNDCTL: Bus
-               libnvdimm: control class device in /sys/class
-               libnvdimm: bus
-               libndctl: bus enumeration example
-           LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
-               libnvdimm: DIMM (NMEM)
-               libndctl: DIMM enumeration example
-           LIBNVDIMM/LIBNDCTL: Region
-               libnvdimm: region
-               libndctl: region enumeration example
-               Why Not Encode the Region Type into the Region Name?
-               How Do I Determine the Major Type of a Region?
-           LIBNVDIMM/LIBNDCTL: Namespace
-               libnvdimm: namespace
-               libndctl: namespace enumeration example
-               libndctl: namespace creation example
-               Why the Term "namespace"?
-           LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
-               libnvdimm: btt layout
-               libndctl: btt creation example
-       Summary LIBNDCTL Diagram
-
-
-Glossary
-========
-
-PMEM:
-  A system-physical-address range where writes are persistent.  A
-  block device composed of PMEM is capable of DAX.  A PMEM address range
-  may span an interleave of several DIMMs.
-
-BLK:
-  A set of one or more programmable memory mapped apertures provided
-  by a DIMM to access its media.  This indirection precludes the
-  performance benefit of interleaving, but enables DIMM-bounded failure
-  modes.
-
-DPA:
-  DIMM Physical Address, is a DIMM-relative offset.  With one DIMM in
-  the system there would be a 1:1 system-physical-address:DPA association.
-  Once more DIMMs are added a memory controller interleave must be
-  decoded to determine the DPA associated with a given
-  system-physical-address.  BLK capacity always has a 1:1 relationship
-  with a single-DIMM's DPA range.
-
-DAX:
-  File system extensions to bypass the page cache and block layer to
-  mmap persistent memory, from a PMEM block device, directly into a
-  process address space.
-
-DSM:
-  Device Specific Method: ACPI method to to control specific
-  device - in this case the firmware.
-
-DCR:
-  NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5.
-  It defines a vendor-id, device-id, and interface format for a given DIMM.
-
-BTT:
-  Block Translation Table: Persistent memory is byte addressable.
-  Existing software may have an expectation that the power-fail-atomicity
-  of writes is at least one sector, 512 bytes.  The BTT is an indirection
-  table with atomic update semantics to front a PMEM/BLK block device
-  driver and present arbitrary atomic sector sizes.
-
-LABEL:
-  Metadata stored on a DIMM device that partitions and identifies
-  (persistently names) storage between PMEM and BLK.  It also partitions
-  BLK storage to host BTTs with different parameters per BLK-partition.
-  Note that traditional partition tables, GPT/MBR, are layered on top of a
-  BLK or PMEM device.
-
-
-Overview
-========
-
-The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely,
-PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM
-and BLK mode access.  These three modes of operation are described by
-the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6.  While the LIBNVDIMM
-implementation is generic and supports pre-NFIT platforms, it was guided
-by the superset of capabilities need to support this ACPI 6 definition
-for NVDIMM resources.  The bulk of the kernel implementation is in place
-to handle the case where DPA accessible via PMEM is aliased with DPA
-accessible via BLK.  When that occurs a LABEL is needed to reserve DPA
-for exclusive access via one mode a time.
-
-Supporting Documents
---------------------
-
-ACPI 6:
-       http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
-NVDIMM Namespace:
-       http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
-DSM Interface Example:
-       http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
-Driver Writer's Guide:
-       http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
-
-Git Trees
----------
-
-LIBNVDIMM:
-       https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git
-LIBNDCTL:
-       https://github.com/pmem/ndctl.git
-PMEM:
-       https://github.com/01org/prd
-
-
-LIBNVDIMM PMEM and BLK
-======================
-
-Prior to the arrival of the NFIT, non-volatile memory was described to a
-system in various ad-hoc ways.  Usually only the bare minimum was
-provided, namely, a single system-physical-address range where writes
-are expected to be durable after a system power loss.  Now, the NFIT
-specification standardizes not only the description of PMEM, but also
-BLK and platform message-passing entry points for control and
-configuration.
-
-For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block
-device driver:
-
-    1. PMEM (nd_pmem.ko): Drives a system-physical-address range.  This
-       range is contiguous in system memory and may be interleaved (hardware
-       memory controller striped) across multiple DIMMs.  When interleaved the
-       platform may optionally provide details of which DIMMs are participating
-       in the interleave.
-
-       Note that while LIBNVDIMM describes system-physical-address ranges that may
-       alias with BLK access as ND_NAMESPACE_PMEM ranges and those without
-       alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no
-       distinction.  The different device-types are an implementation detail
-       that userspace can exploit to implement policies like "only interface
-       with address ranges from certain DIMMs".  It is worth noting that when
-       aliasing is present and a DIMM lacks a label, then no block device can
-       be created by default as userspace needs to do at least one allocation
-       of DPA to the PMEM range.  In contrast ND_NAMESPACE_IO ranges, once
-       registered, can be immediately attached to nd_pmem.
-
-    2. BLK (nd_blk.ko): This driver performs I/O using a set of platform
-       defined apertures.  A set of apertures will access just one DIMM.
-       Multiple windows (apertures) allow multiple concurrent accesses, much like
-       tagged-command-queuing, and would likely be used by different threads or
-       different CPUs.
-
-       The NFIT specification defines a standard format for a BLK-aperture, but
-       the spec also allows for vendor specific layouts, and non-NFIT BLK
-       implementations may have other designs for BLK I/O.  For this reason
-       "nd_blk" calls back into platform-specific code to perform the I/O.
-
-       One such implementation is defined in the "Driver Writer's Guide" and "DSM
-       Interface Example".
-
-
-Why BLK?
-========
-
-While PMEM provides direct byte-addressable CPU-load/store access to
-NVDIMM storage, it does not provide the best system RAS (recovery,
-availability, and serviceability) model.  An access to a corrupted
-system-physical-address address causes a CPU exception while an access
-to a corrupted address through an BLK-aperture causes that block window
-to raise an error status in a register.  The latter is more aligned with
-the standard error model that host-bus-adapter attached disks present.
-
-Also, if an administrator ever wants to replace a memory it is easier to
-service a system at DIMM module boundaries.  Compare this to PMEM where
-data could be interleaved in an opaque hardware specific manner across
-several DIMMs.
-
-PMEM vs BLK
------------
-
-BLK-apertures solve these RAS problems, but their presence is also the
-major contributing factor to the complexity of the ND subsystem.  They
-complicate the implementation because PMEM and BLK alias in DPA space.
-Any given DIMM's DPA-range may contribute to one or more
-system-physical-address sets of interleaved DIMMs, *and* may also be
-accessed in its entirety through its BLK-aperture.  Accessing a DPA
-through a system-physical-address while simultaneously accessing the
-same DPA through a BLK-aperture has undefined results.  For this reason,
-DIMMs with this dual interface configuration include a DSM function to
-store/retrieve a LABEL.  The LABEL effectively partitions the DPA-space
-into exclusive system-physical-address and BLK-aperture accessible
-regions.  For simplicity a DIMM is allowed a PMEM "region" per each
-interleave set in which it is a member.  The remaining DPA space can be
-carved into an arbitrary number of BLK devices with discontiguous
-extents.
-
-BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-One of the few
-reasons to allow multiple BLK namespaces per REGION is so that each
-BLK-namespace can be configured with a BTT with unique atomic sector
-sizes.  While a PMEM device can host a BTT the LABEL specification does
-not provide for a sector size to be specified for a PMEM namespace.
-
-This is due to the expectation that the primary usage model for PMEM is
-via DAX, and the BTT is incompatible with DAX.  However, for the cases
-where an application or filesystem still needs atomic sector update
-guarantees it can register a BTT on a PMEM device or partition.  See
-LIBNVDIMM/NDCTL: Block Translation Table "btt"
-
-
-Example NVDIMM Platform
-=======================
-
-For the remainder of this document the following diagram will be
-referenced for any example sysfs layouts::
-
-
-                               (a)               (b)           DIMM   BLK-REGION
-            +-------------------+--------+--------+--------+
-  +------+  |       pm0.0       | blk2.0 | pm1.0  | blk2.1 |    0      region2
-  | imc0 +--+- - - region0- - - +--------+        +--------+
-  +--+---+  |       pm0.0       | blk3.0 | pm1.0  | blk3.1 |    1      region3
-     |      +-------------------+--------v        v--------+
-  +--+---+                               |                 |
-  | cpu0 |                                     region1
-  +--+---+                               |                 |
-     |      +----------------------------^        ^--------+
-  +--+---+  |           blk4.0           | pm1.0  | blk4.0 |    2      region4
-  | imc1 +--+----------------------------|        +--------+
-  +------+  |           blk5.0           | pm1.0  | blk5.0 |    3      region5
-            +----------------------------+--------+--------+
-
-In this platform we have four DIMMs and two memory controllers in one
-socket.  Each unique interface (BLK or PMEM) to DPA space is identified
-by a region device with a dynamically assigned id (REGION0 - REGION5).
-
-    1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A
-       single PMEM namespace is created in the REGION0-SPA-range that spans most
-       of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that
-       interleaved system-physical-address range is reclaimed as BLK-aperture
-       accessed space starting at DPA-offset (a) into each DIMM.  In that
-       reclaimed space we create two BLK-aperture "namespaces" from REGION2 and
-       REGION3 where "blk2.0" and "blk3.0" are just human readable names that
-       could be set to any user-desired name in the LABEL.
-
-    2. In the last portion of DIMM0 and DIMM1 we have an interleaved
-       system-physical-address range, REGION1, that spans those two DIMMs as
-       well as DIMM2 and DIMM3.  Some of REGION1 is allocated to a PMEM namespace
-       named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for
-       each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
-       "blk5.0".
-
-    3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1
-       interleaved system-physical-address range (i.e. the DPA address past
-       offset (b) are also included in the "blk4.0" and "blk5.0" namespaces.
-       Note, that this example shows that BLK-aperture namespaces don't need to
-       be contiguous in DPA-space.
-
-    This bus is provided by the kernel under the device
-    /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and
-    the nfit_test.ko module is loaded.  This not only test LIBNVDIMM but the
-    acpi_nfit.ko driver as well.
-
-
-LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
-========================================================
-
-What follows is a description of the LIBNVDIMM sysfs layout and a
-corresponding object hierarchy diagram as viewed through the LIBNDCTL
-API.  The example sysfs paths and diagrams are relative to the Example
-NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit
-test.
-
-LIBNDCTL: Context
------------------
-
-Every API call in the LIBNDCTL library requires a context that holds the
-logging parameters and other library instance state.  The library is
-based on the libabc template:
-
-       https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git
-
-LIBNDCTL: instantiate a new library context example
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-::
-
-       struct ndctl_ctx *ctx;
-
-       if (ndctl_new(&ctx) == 0)
-               return ctx;
-       else
-               return NULL;
-
-LIBNVDIMM/LIBNDCTL: Bus
------------------------
-
-A bus has a 1:1 relationship with an NFIT.  The current expectation for
-ACPI based systems is that there is only ever one platform-global NFIT.
-That said, it is trivial to register multiple NFITs, the specification
-does not preclude it.  The infrastructure supports multiple busses and
-we use this capability to test multiple NFIT configurations in the unit
-test.
-
-LIBNVDIMM: control class device in /sys/class
----------------------------------------------
-
-This character device accepts DSM messages to be passed to DIMM
-identified by its NFIT handle::
-
-       /sys/class/nd/ndctl0
-       |-- dev
-       |-- device -> ../../../ndbus0
-       |-- subsystem -> ../../../../../../../class/nd
-
-
-
-LIBNVDIMM: bus
---------------
-
-::
-
-       struct nvdimm_bus *nvdimm_bus_register(struct device *parent,
-              struct nvdimm_bus_descriptor *nfit_desc);
-
-::
-
-       /sys/devices/platform/nfit_test.0/ndbus0
-       |-- commands
-       |-- nd
-       |-- nfit
-       |-- nmem0
-       |-- nmem1
-       |-- nmem2
-       |-- nmem3
-       |-- power
-       |-- provider
-       |-- region0
-       |-- region1
-       |-- region2
-       |-- region3
-       |-- region4
-       |-- region5
-       |-- uevent
-       `-- wait_probe
-
-LIBNDCTL: bus enumeration example
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Find the bus handle that describes the bus from Example NVDIMM Platform::
-
-       static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx,
-                       const char *provider)
-       {
-               struct ndctl_bus *bus;
-
-               ndctl_bus_foreach(ctx, bus)
-                       if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0)
-                               return bus;
-
-               return NULL;
-       }
-
-       bus = get_bus_by_provider(ctx, "nfit_test.0");
-
-
-LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
--------------------------------
-
-The DIMM device provides a character device for sending commands to
-hardware, and it is a container for LABELs.  If the DIMM is defined by
-NFIT then an optional 'nfit' attribute sub-directory is available to add
-NFIT-specifics.
-
-Note that the kernel device name for "DIMMs" is "nmemX".  The NFIT
-describes these devices via "Memory Device to System Physical Address
-Range Mapping Structure", and there is no requirement that they actually
-be physical DIMMs, so we use a more generic name.
-
-LIBNVDIMM: DIMM (NMEM)
-^^^^^^^^^^^^^^^^^^^^^^
-
-::
-
-       struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data,
-                       const struct attribute_group **groups, unsigned long flags,
-                       unsigned long *dsm_mask);
-
-::
-
-       /sys/devices/platform/nfit_test.0/ndbus0
-       |-- nmem0
-       |   |-- available_slots
-       |   |-- commands
-       |   |-- dev
-       |   |-- devtype
-       |   |-- driver -> ../../../../../bus/nd/drivers/nvdimm
-       |   |-- modalias
-       |   |-- nfit
-       |   |   |-- device
-       |   |   |-- format
-       |   |   |-- handle
-       |   |   |-- phys_id
-       |   |   |-- rev_id
-       |   |   |-- serial
-       |   |   `-- vendor
-       |   |-- state
-       |   |-- subsystem -> ../../../../../bus/nd
-       |   `-- uevent
-       |-- nmem1
-       [..]
-
-
-LIBNDCTL: DIMM enumeration example
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Note, in this example we are assuming NFIT-defined DIMMs which are
-identified by an "nfit_handle" a 32-bit value where:
-
-   - Bit 3:0 DIMM number within the memory channel
-   - Bit 7:4 memory channel number
-   - Bit 11:8 memory controller ID
-   - Bit 15:12 socket ID (within scope of a Node controller if node
-     controller is present)
-   - Bit 27:16 Node Controller ID
-   - Bit 31:28 Reserved
-
-::
-
-       static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus,
-              unsigned int handle)
-       {
-               struct ndctl_dimm *dimm;
-
-               ndctl_dimm_foreach(bus, dimm)
-                       if (ndctl_dimm_get_handle(dimm) == handle)
-                               return dimm;
-
-               return NULL;
-       }
-
-       #define DIMM_HANDLE(n, s, i, c, d) \
-               (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \
-                | ((c & 0xf) << 4) | (d & 0xf))
-
-       dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0));
-
-LIBNVDIMM/LIBNDCTL: Region
---------------------------
-
-A generic REGION device is registered for each PMEM range or BLK-aperture
-set.  Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture
-sets on the "nfit_test.0" bus.  The primary role of regions are to be a
-container of "mappings".  A mapping is a tuple of <DIMM,
-DPA-start-offset, length>.
-
-LIBNVDIMM provides a built-in driver for these REGION devices.  This driver
-is responsible for reconciling the aliased DPA mappings across all
-regions, parsing the LABEL, if present, and then emitting NAMESPACE
-devices with the resolved/exclusive DPA-boundaries for the nd_pmem or
-nd_blk device driver to consume.
-
-In addition to the generic attributes of "mapping"s, "interleave_ways"
-and "size" the REGION device also exports some convenience attributes.
-"nstype" indicates the integer type of namespace-device this region
-emits, "devtype" duplicates the DEVTYPE variable stored by udev at the
-'add' event, "modalias" duplicates the MODALIAS variable stored by udev
-at the 'add' event, and finally, the optional "spa_index" is provided in
-the case where the region is defined by a SPA.
-
-LIBNVDIMM: region::
-
-       struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus,
-                       struct nd_region_desc *ndr_desc);
-       struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
-                       struct nd_region_desc *ndr_desc);
-
-::
-
-       /sys/devices/platform/nfit_test.0/ndbus0
-       |-- region0
-       |   |-- available_size
-       |   |-- btt0
-       |   |-- btt_seed
-       |   |-- devtype
-       |   |-- driver -> ../../../../../bus/nd/drivers/nd_region
-       |   |-- init_namespaces
-       |   |-- mapping0
-       |   |-- mapping1
-       |   |-- mappings
-       |   |-- modalias
-       |   |-- namespace0.0
-       |   |-- namespace_seed
-       |   |-- numa_node
-       |   |-- nfit
-       |   |   `-- spa_index
-       |   |-- nstype
-       |   |-- set_cookie
-       |   |-- size
-       |   |-- subsystem -> ../../../../../bus/nd
-       |   `-- uevent
-       |-- region1
-       [..]
-
-LIBNDCTL: region enumeration example
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Sample region retrieval routines based on NFIT-unique data like
-"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for
-BLK::
-
-       static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus,
-                       unsigned int spa_index)
-       {
-               struct ndctl_region *region;
-
-               ndctl_region_foreach(bus, region) {
-                       if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM)
-                               continue;
-                       if (ndctl_region_get_spa_index(region) == spa_index)
-                               return region;
-               }
-               return NULL;
-       }
-
-       static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus,
-                       unsigned int handle)
-       {
-               struct ndctl_region *region;
-
-               ndctl_region_foreach(bus, region) {
-                       struct ndctl_mapping *map;
-
-                       if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK)
-                               continue;
-                       ndctl_mapping_foreach(region, map) {
-                               struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map);
-
-                               if (ndctl_dimm_get_handle(dimm) == handle)
-                                       return region;
-                       }
-               }
-               return NULL;
-       }
-
-
-Why Not Encode the Region Type into the Region Name?
-----------------------------------------------------
-
-At first glance it seems since NFIT defines just PMEM and BLK interface
-types that we should simply name REGION devices with something derived
-from those type names.  However, the ND subsystem explicitly keeps the
-REGION name generic and expects userspace to always consider the
-region-attributes for four reasons:
-
-    1. There are already more than two REGION and "namespace" types.  For
-       PMEM there are two subtypes.  As mentioned previously we have PMEM where
-       the constituent DIMM devices are known and anonymous PMEM.  For BLK
-       regions the NFIT specification already anticipates vendor specific
-       implementations.  The exact distinction of what a region contains is in
-       the region-attributes not the region-name or the region-devtype.
-
-    2. A region with zero child-namespaces is a possible configuration.  For
-       example, the NFIT allows for a DCR to be published without a
-       corresponding BLK-aperture.  This equates to a DIMM that can only accept
-       control/configuration messages, but no i/o through a descendant block
-       device.  Again, this "type" is advertised in the attributes ('mappings'
-       == 0) and the name does not tell you much.
-
-    3. What if a third major interface type arises in the future?  Outside
-       of vendor specific implementations, it's not difficult to envision a
-       third class of interface type beyond BLK and PMEM.  With a generic name
-       for the REGION level of the device-hierarchy old userspace
-       implementations can still make sense of new kernel advertised
-       region-types.  Userspace can always rely on the generic region
-       attributes like "mappings", "size", etc and the expected child devices
-       named "namespace".  This generic format of the device-model hierarchy
-       allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and
-       future-proof.
-
-    4. There are more robust mechanisms for determining the major type of a
-       region than a device name.  See the next section, How Do I Determine the
-       Major Type of a Region?
-
-How Do I Determine the Major Type of a Region?
-----------------------------------------------
-
-Outside of the blanket recommendation of "use libndctl", or simply
-looking at the kernel header (/usr/include/linux/ndctl.h) to decode the
-"nstype" integer attribute, here are some other options.
-
-1. module alias lookup
-^^^^^^^^^^^^^^^^^^^^^^
-
-    The whole point of region/namespace device type differentiation is to
-    decide which block-device driver will attach to a given LIBNVDIMM namespace.
-    One can simply use the modalias to lookup the resulting module.  It's
-    important to note that this method is robust in the presence of a
-    vendor-specific driver down the road.  If a vendor-specific
-    implementation wants to supplant the standard nd_blk driver it can with
-    minimal impact to the rest of LIBNVDIMM.
-
-    In fact, a vendor may also want to have a vendor-specific region-driver
-    (outside of nd_region).  For example, if a vendor defined its own LABEL
-    format it would need its own region driver to parse that LABEL and emit
-    the resulting namespaces.  The output from module resolution is more
-    accurate than a region-name or region-devtype.
-
-2. udev
-^^^^^^^
-
-    The kernel "devtype" is registered in the udev database::
-
-       # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0
-       P: /devices/platform/nfit_test.0/ndbus0/region0
-       E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0
-       E: DEVTYPE=nd_pmem
-       E: MODALIAS=nd:t2
-       E: SUBSYSTEM=nd
-
-       # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4
-       P: /devices/platform/nfit_test.0/ndbus0/region4
-       E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4
-       E: DEVTYPE=nd_blk
-       E: MODALIAS=nd:t3
-       E: SUBSYSTEM=nd
-
-    ...and is available as a region attribute, but keep in mind that the
-    "devtype" does not indicate sub-type variations and scripts should
-    really be understanding the other attributes.
-
-3. type specific attributes
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-    As it currently stands a BLK-aperture region will never have a
-    "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region.  A
-    BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM
-    that does not allow I/O.  A PMEM region with a "mappings" value of zero
-    is a simple system-physical-address range.
-
-
-LIBNVDIMM/LIBNDCTL: Namespace
------------------------------
-
-A REGION, after resolving DPA aliasing and LABEL specified boundaries,
-surfaces one or more "namespace" devices.  The arrival of a "namespace"
-device currently triggers either the nd_blk or nd_pmem driver to load
-and register a disk/block device.
-
-LIBNVDIMM: namespace
-^^^^^^^^^^^^^^^^^^^^
-
-Here is a sample layout from the three major types of NAMESPACE where
-namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid'
-attribute), namespace2.0 represents a BLK namespace (note it has a
-'sector_size' attribute) that, and namespace6.0 represents an anonymous
-PMEM namespace (note that has no 'uuid' attribute due to not support a
-LABEL)::
-
-       /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0
-       |-- alt_name
-       |-- devtype
-       |-- dpa_extents
-       |-- force_raw
-       |-- modalias
-       |-- numa_node
-       |-- resource
-       |-- size
-       |-- subsystem -> ../../../../../../bus/nd
-       |-- type
-       |-- uevent
-       `-- uuid
-       /sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0
-       |-- alt_name
-       |-- devtype
-       |-- dpa_extents
-       |-- force_raw
-       |-- modalias
-       |-- numa_node
-       |-- sector_size
-       |-- size
-       |-- subsystem -> ../../../../../../bus/nd
-       |-- type
-       |-- uevent
-       `-- uuid
-       /sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0
-       |-- block
-       |   `-- pmem0
-       |-- devtype
-       |-- driver -> ../../../../../../bus/nd/drivers/pmem
-       |-- force_raw
-       |-- modalias
-       |-- numa_node
-       |-- resource
-       |-- size
-       |-- subsystem -> ../../../../../../bus/nd
-       |-- type
-       `-- uevent
-
-LIBNDCTL: namespace enumeration example
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Namespaces are indexed relative to their parent region, example below.
-These indexes are mostly static from boot to boot, but subsystem makes
-no guarantees in this regard.  For a static namespace identifier use its
-'uuid' attribute.
-
-::
-
-  static struct ndctl_namespace
-  *get_namespace_by_id(struct ndctl_region *region, unsigned int id)
-  {
-          struct ndctl_namespace *ndns;
-
-          ndctl_namespace_foreach(region, ndns)
-                  if (ndctl_namespace_get_id(ndns) == id)
-                          return ndns;
-
-          return NULL;
-  }
-
-LIBNDCTL: namespace creation example
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Idle namespaces are automatically created by the kernel if a given
-region has enough available capacity to create a new namespace.
-Namespace instantiation involves finding an idle namespace and
-configuring it.  For the most part the setting of namespace attributes
-can occur in any order, the only constraint is that 'uuid' must be set
-before 'size'.  This enables the kernel to track DPA allocations
-internally with a static identifier::
-
-  static int configure_namespace(struct ndctl_region *region,
-                  struct ndctl_namespace *ndns,
-                  struct namespace_parameters *parameters)
-  {
-          char devname[50];
-
-          snprintf(devname, sizeof(devname), "namespace%d.%d",
-                          ndctl_region_get_id(region), paramaters->id);
-
-          ndctl_namespace_set_alt_name(ndns, devname);
-          /* 'uuid' must be set prior to setting size! */
-          ndctl_namespace_set_uuid(ndns, paramaters->uuid);
-          ndctl_namespace_set_size(ndns, paramaters->size);
-          /* unlike pmem namespaces, blk namespaces have a sector size */
-          if (parameters->lbasize)
-                  ndctl_namespace_set_sector_size(ndns, parameters->lbasize);
-          ndctl_namespace_enable(ndns);
-  }
-
-
-Why the Term "namespace"?
-^^^^^^^^^^^^^^^^^^^^^^^^^
-
-    1. Why not "volume" for instance?  "volume" ran the risk of confusing
-       ND (libnvdimm subsystem) to a volume manager like device-mapper.
-
-    2. The term originated to describe the sub-devices that can be created
-       within a NVME controller (see the nvme specification:
-       http://www.nvmexpress.org/specifications/), and NFIT namespaces are
-       meant to parallel the capabilities and configurability of
-       NVME-namespaces.
-
-
-LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
--------------------------------------------------
-
-A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked
-block device driver that fronts either the whole block device or a
-partition of a block device emitted by either a PMEM or BLK NAMESPACE.
-
-LIBNVDIMM: btt layout
-^^^^^^^^^^^^^^^^^^^^^
-
-Every region will start out with at least one BTT device which is the
-seed device.  To activate it set the "namespace", "uuid", and
-"sector_size" attributes and then bind the device to the nd_pmem or
-nd_blk driver depending on the region type::
-
-       /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/
-       |-- namespace
-       |-- delete
-       |-- devtype
-       |-- modalias
-       |-- numa_node
-       |-- sector_size
-       |-- subsystem -> ../../../../../bus/nd
-       |-- uevent
-       `-- uuid
-
-LIBNDCTL: btt creation example
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Similar to namespaces an idle BTT device is automatically created per
-region.  Each time this "seed" btt device is configured and enabled a new
-seed is created.  Creating a BTT configuration involves two steps of
-finding and idle BTT and assigning it to consume a PMEM or BLK namespace::
-
-       static struct ndctl_btt *get_idle_btt(struct ndctl_region *region)
-       {
-               struct ndctl_btt *btt;
-
-               ndctl_btt_foreach(region, btt)
-                       if (!ndctl_btt_is_enabled(btt)
-                                       && !ndctl_btt_is_configured(btt))
-                               return btt;
-
-               return NULL;
-       }
-
-       static int configure_btt(struct ndctl_region *region,
-                       struct btt_parameters *parameters)
-       {
-               btt = get_idle_btt(region);
-
-               ndctl_btt_set_uuid(btt, parameters->uuid);
-               ndctl_btt_set_sector_size(btt, parameters->sector_size);
-               ndctl_btt_set_namespace(btt, parameters->ndns);
-               /* turn off raw mode device */
-               ndctl_namespace_disable(parameters->ndns);
-               /* turn on btt access */
-               ndctl_btt_enable(btt);
-       }
-
-Once instantiated a new inactive btt seed device will appear underneath
-the region.
-
-Once a "namespace" is removed from a BTT that instance of the BTT device
-will be deleted or otherwise reset to default values.  This deletion is
-only at the device model level.  In order to destroy a BTT the "info
-block" needs to be destroyed.  Note, that to destroy a BTT the media
-needs to be written in raw mode.  By default, the kernel will autodetect
-the presence of a BTT and disable raw mode.  This autodetect behavior
-can be suppressed by enabling raw mode for the namespace via the
-ndctl_namespace_set_raw_mode() API.
-
-
-Summary LIBNDCTL Diagram
-------------------------
-
-For the given example above, here is the view of the objects as seen by the
-LIBNDCTL API::
-
-              +---+
-              |CTX|    +---------+   +--------------+  +---------------+
-              +-+-+  +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" |
-                |    | +---------+   +--------------+  +---------------+
-  +-------+     |    | +---------+   +--------------+  +---------------+
-  | DIMM0 <-+   |    +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" |
-  +-------+ |   |    | +---------+   +--------------+  +---------------+
-  | DIMM1 <-+ +-v--+ | +---------+   +--------------+  +---------------+
-  +-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6  "blk2.0" |
-  | DIMM2 <-+ +----+ | +---------+ | +--------------+  +----------------------+
-  +-------+ |        |             +-> NAMESPACE2.1 +--> ND5  "blk2.1" | BTT2 |
-  | DIMM3 <-+        |               +--------------+  +----------------------+
-  +-------+          | +---------+   +--------------+  +---------------+
-                     +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4  "blk3.0" |
-                     | +---------+ | +--------------+  +----------------------+
-                     |             +-> NAMESPACE3.1 +--> ND3  "blk3.1" | BTT1 |
-                     |               +--------------+  +----------------------+
-                     | +---------+   +--------------+  +---------------+
-                     +-> REGION4 +---> NAMESPACE4.0 +--> ND2  "blk4.0" |
-                     | +---------+   +--------------+  +---------------+
-                     | +---------+   +--------------+  +----------------------+
-                     +-> REGION5 +---> NAMESPACE5.0 +--> ND1  "blk5.0" | BTT0 |
-                       +---------+   +--------------+  +---------------+------+
diff --git a/Documentation/nvdimm/security.rst b/Documentation/nvdimm/security.rst

deleted file mode 100644 (file)

index ad9dea0..0000000
--- a/Documentation/nvdimm/security.rst
+++ /dev/null
@@ -1,143 +0,0 @@
-===============
-NVDIMM Security
-===============
-
-1. Introduction
----------------
-
-With the introduction of Intel Device Specific Methods (DSM) v1.8
-specification [1], security DSMs are introduced. The spec added the following
-security DSMs: "get security state", "set passphrase", "disable passphrase",
-"unlock unit", "freeze lock", "secure erase", and "overwrite". A security_ops
-data structure has been added to struct dimm in order to support the security
-operations and generic APIs are exposed to allow vendor neutral operations.
-
-2. Sysfs Interface
-------------------
-The "security" sysfs attribute is provided in the nvdimm sysfs directory. For
-example:
-/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/security
-
-The "show" attribute of that attribute will display the security state for
-that DIMM. The following states are available: disabled, unlocked, locked,
-frozen, and overwrite. If security is not supported, the sysfs attribute
-will not be visible.
-
-The "store" attribute takes several commands when it is being written to
-in order to support some of the security functionalities:
-update <old_keyid> <new_keyid> - enable or update passphrase.
-disable <keyid> - disable enabled security and remove key.
-freeze - freeze changing of security states.
-erase <keyid> - delete existing user encryption key.
-overwrite <keyid> - wipe the entire nvdimm.
-master_update <keyid> <new_keyid> - enable or update master passphrase.
-master_erase <keyid> - delete existing user encryption key.
-
-3. Key Management
------------------
-
-The key is associated to the payload by the DIMM id. For example:
-# cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/nfit/id
-8089-a2-1740-00000133
-The DIMM id would be provided along with the key payload (passphrase) to
-the kernel.
-
-The security keys are managed on the basis of a single key per DIMM. The
-key "passphrase" is expected to be 32bytes long. This is similar to the ATA
-security specification [2]. A key is initially acquired via the request_key()
-kernel API call during nvdimm unlock. It is up to the user to make sure that
-all the keys are in the kernel user keyring for unlock.
-
-A nvdimm encrypted-key of format enc32 has the description format of:
-nvdimm:<bus-provider-specific-unique-id>
-
-See file ``Documentation/security/keys/trusted-encrypted.rst`` for creating
-encrypted-keys of enc32 format. TPM usage with a master trusted key is
-preferred for sealing the encrypted-keys.
-
-4. Unlocking
-------------
-When the DIMMs are being enumerated by the kernel, the kernel will attempt to
-retrieve the key from the kernel user keyring. This is the only time
-a locked DIMM can be unlocked. Once unlocked, the DIMM will remain unlocked
-until reboot. Typically an entity (i.e. shell script) will inject all the
-relevant encrypted-keys into the kernel user keyring during the initramfs phase.
-This provides the unlock function access to all the related keys that contain
-the passphrase for the respective nvdimms.  It is also recommended that the
-keys are injected before libnvdimm is loaded by modprobe.
-
-5. Update
----------
-When doing an update, it is expected that the existing key is removed from
-the kernel user keyring and reinjected as different (old) key. It's irrelevant
-what the key description is for the old key since we are only interested in the
-keyid when doing the update operation. It is also expected that the new key
-is injected with the description format described from earlier in this
-document.  The update command written to the sysfs attribute will be with
-the format:
-update <old keyid> <new keyid>
-
-If there is no old keyid due to a security enabling, then a 0 should be
-passed in.
-
-6. Freeze
----------
-The freeze operation does not require any keys. The security config can be
-frozen by a user with root privelege.
-
-7. Disable
-----------
-The security disable command format is:
-disable <keyid>
-
-An key with the current passphrase payload that is tied to the nvdimm should be
-in the kernel user keyring.
-
-8. Secure Erase
----------------
-The command format for doing a secure erase is:
-erase <keyid>
-
-An key with the current passphrase payload that is tied to the nvdimm should be
-in the kernel user keyring.
-
-9. Overwrite
-------------
-The command format for doing an overwrite is:
-overwrite <keyid>
-
-Overwrite can be done without a key if security is not enabled. A key serial
-of 0 can be passed in to indicate no key.
-
-The sysfs attribute "security" can be polled to wait on overwrite completion.
-Overwrite can last tens of minutes or more depending on nvdimm size.
-
-An encrypted-key with the current user passphrase that is tied to the nvdimm
-should be injected and its keyid should be passed in via sysfs.
-
-10. Master Update
------------------
-The command format for doing a master update is:
-update <old keyid> <new keyid>
-
-The operating mechanism for master update is identical to update except the
-master passphrase key is passed to the kernel. The master passphrase key
-is just another encrypted-key.
-
-This command is only available when security is disabled.
-
-11. Master Erase
-----------------
-The command format for doing a master erase is:
-master_erase <current keyid>
-
-This command has the same operating mechanism as erase except the master
-passphrase key is passed to the kernel. The master passphrase key is just
-another encrypted-key.
-
-This command is only available when the master security is enabled, indicated
-by the extended security status.
-
-[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf
-
-[2]: http://www.t13.org/documents/UploadedDocuments/docs2006/e05179r4-ACS-SecurityClarifications.pdf
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig

index e89c1c332407a79defc9e5f32367f26d6c16cae2..a5fde15e91d397ec4fe81796f88df75abdd4b0ca 100644 (file)
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -33,7 +33,7 @@ config BLK_DEV_PMEM
           Documentation/admin-guide/kernel-parameters.rst).  This driver converts
           these persistent memory ranges into block devices that are
           capable of DAX (direct-access) file system mappings.  See
-         Documentation/nvdimm/nvdimm.rst for more details.
+         Documentation/driver-api/nvdimm/nvdimm.rst for more details.
  
           Say Y if you want to use an NVDIMM
author	Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
	Tue, 18 Jun 2019 19:32:31 +0000 (16:32 -0300)
committer	Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
	Mon, 15 Jul 2019 12:20:27 +0000 (09:20 -0300)
Documentation/driver-api/index.rst		patch \| blob \| blame \| history
Documentation/driver-api/nvdimm/btt.rst	[new file with mode: 0644]	patch \| blob
Documentation/driver-api/nvdimm/index.rst	[new file with mode: 0644]	patch \| blob
Documentation/driver-api/nvdimm/nvdimm.rst	[new file with mode: 0644]	patch \| blob
Documentation/driver-api/nvdimm/security.rst	[new file with mode: 0644]	patch \| blob
Documentation/nvdimm/btt.rst	[deleted file]	patch \| blob \| blame \| history
Documentation/nvdimm/index.rst	[deleted file]	patch \| blob \| blame \| history
Documentation/nvdimm/nvdimm.rst	[deleted file]	patch \| blob \| blame \| history
Documentation/nvdimm/security.rst	[deleted file]	patch \| blob \| blame \| history
drivers/nvdimm/Kconfig		patch \| blob \| blame \| history