ceph/src/spdk/doc/ftl.md

   1 # Flash Translation Layer {#ftl}
   2
   3 The Flash Translation Layer library provides block device access on top of devices
   4 implementing bdev_zone interface.
   5 It handles the logical to physical address mapping, responds to the asynchronous
   6 media management events, and manages the defragmentation process.
   7
   8 # Terminology {#ftl_terminology}
   9
  10 ## Logical to physical address map
  11
  12  * Shorthand: L2P
  13
  14 Contains the mapping of the logical addresses (LBA) to their on-disk physical location. The LBAs
  15 are contiguous and in range from 0 to the number of surfaced blocks (the number of spare blocks
  16 are calculated during device formation and are subtracted from the available address space). The
  17 spare blocks account for zones going offline throughout the lifespan of the device as well as
  18 provide necessary buffer for data [defragmentation](#ftl_reloc).
  19
  20 ## Band {#ftl_band}
  21
  22 A band describes a collection of zones, each belonging to a different parallel unit. All writes to
  23 a band follow the same pattern - a batch of logical blocks is written to one zone, another batch
  24 to the next one and so on. This ensures the parallelism of the write operations, as they can be
  25 executed independently on different zones. Each band keeps track of the LBAs it consists of, as
  26 well as their validity, as some of the data will be invalidated by subsequent writes to the same
  27 logical address. The L2P mapping can be restored from the SSD by reading this information in order
  28 from the oldest band to the youngest.
  29
  30              +--------------+        +--------------+                        +--------------+
  31     band 1   |   zone 1     +--------+    zone 1    +---- --- --- --- --- ---+     zone 1   |
  32              +--------------+        +--------------+                        +--------------+
  33     band 2   |   zone 2     +--------+     zone 2   +---- --- --- --- --- ---+     zone 2   |
  34              +--------------+        +--------------+                        +--------------+
  35     band 3   |   zone 3     +--------+     zone 3   +---- --- --- --- --- ---+     zone 3   |
  36              +--------------+        +--------------+                        +--------------+
  37              |     ...      |        |     ...      |                        |     ...      |
  38              +--------------+        +--------------+                        +--------------+
  39     band m   |   zone m     +--------+     zone m   +---- --- --- --- --- ---+     zone m   |
  40              +--------------+        +--------------+                        +--------------+
  41              |     ...      |        |     ...      |                        |     ...      |
  42              +--------------+        +--------------+                        +--------------+
  43
  44               parallel unit 1              pu 2                                    pu n
  45
  46 The address map and valid map are, along with a several other things (e.g. UUID of the device it's
  47 part of, number of surfaced LBAs, band's sequence number, etc.), parts of the band's metadata. The
  48 metadata is split in two parts:
  49
  50        head metadata               band's data               tail metadata
  51     +-------------------+-------------------------------+------------------------+
  52     |zone 1 |...|zone n |...|...|zone 1 |...|           | ... |zone  m-1 |zone  m|
  53     |block 1|   |block 1|   |   |block x|   |           |     |block y   |block y|
  54     +-------------------+-------------+-----------------+------------------------+
  55
  56  * the head part, containing information already known when opening the band (device's UUID, band's
  57    sequence number, etc.), located at the beginning blocks of the band,
  58  * the tail part, containing the address map and the valid map, located at the end of the band.
  59
  60 Bands are written sequentially (in a way that was described earlier). Before a band can be written
  61 to, all of its zones need to be erased. During that time, the band is considered to be in a `PREP`
  62 state. After that is done, the band transitions to the `OPENING` state, in which head metadata
  63 is being written. Then the band moves to the `OPEN` state and actual user data can be written to the
  64 band. Once the whole available space is filled, tail metadata is written and the band transitions to
  65 `CLOSING` state. When that finishes the band becomes `CLOSED`.
  66
  67 ## Ring write buffer {#ftl_rwb}
  68
  69  * Shorthand: RWB
  70
  71 Because the smallest write size the SSD may support can be a multiple of block size, in order to
  72 support writes to a single block, the data needs to be buffered. The write buffer is the solution to
  73 this problem. It consists of a number of pre-allocated buffers called batches, each of size allowing
  74 for a single transfer to the SSD. A single batch is divided into block-sized buffer entries.
  75
  76                  write buffer
  77     +-----------------------------------+
  78     |batch 1                            |
  79     |   +-----------------------------+ |
  80     |   |rwb    |rwb    | ... |rwb    | |
  81     |   |entry 1|entry 2|     |entry n| |
  82     |   +-----------------------------+ |
  83     +-----------------------------------+
  84     | ...                               |
  85     +-----------------------------------+
  86     |batch m                            |
  87     |   +-----------------------------+ |
  88     |   |rwb    |rwb    | ... |rwb    | |
  89     |   |entry 1|entry 2|     |entry n| |
  90     |   +-----------------------------+ |
  91     +-----------------------------------+
  92
  93 When a write is scheduled, it needs to acquire an entry for each of its blocks and copy the data
  94 onto this buffer. Once all blocks are copied, the write can be signalled as completed to the user.
  95 In the meantime, the `rwb` is polled for filled batches and, if one is found, it's sent to the SSD.
  96 After that operation is completed the whole batch can be freed. For the whole time the data is in
  97 the `rwb`, the L2P points at the buffer entry instead of a location on the SSD. This allows for
  98 servicing read requests from the buffer.
  99
 100 ## Defragmentation and relocation {#ftl_reloc}
 101
 102  * Shorthand: defrag, reloc
 103
 104 Since a write to the same LBA invalidates its previous physical location, some of the blocks on a
 105 band might contain old data that basically wastes space. As there is no way to overwrite an already
 106 written block, this data will stay there until the whole zone is reset. This might create a
 107 situation in which all of the bands contain some valid data and no band can be erased, so no writes
 108 can be executed anymore. Therefore a mechanism is needed to move valid data and invalidate whole
 109 bands, so that they can be reused.
 110
 111                     band                                             band
 112     +-----------------------------------+            +-----------------------------------+
 113     | ** *    * ***      *    *** * *   |            |                                   |
 114     |**  *       *    *    * *     *   *|   +---->   |                                   |
 115     |*     ***  *      *            *   |            |                                   |
 116     +-----------------------------------+            +-----------------------------------+
 117
 118 Valid blocks are marked with an asterisk '\*'.
 119
 120 Another reason for data relocation might be an event from the SSD telling us that the data might
 121 become corrupt if it's not relocated. This might happen due to its old age (if it was written a
 122 long time ago) or due to read disturb (media characteristic, that causes corruption of neighbouring
 123 blocks during a read operation).
 124
 125 Module responsible for data relocation is called `reloc`. When a band is chosen for defragmentation
 126 or a media management event is received, the appropriate blocks are marked as
 127 required to be moved. The `reloc` module takes a band that has some of such blocks marked, checks
 128 their validity and, if they're still valid, copies them.
 129
 130 Choosing a band for defragmentation depends on several factors: its valid ratio (1) (proportion of
 131 valid blocks to all user blocks), its age (2) (when was it written) and its write count / wear level
 132 index of its zones (3) (how many times the band was written to). The lower the ratio (1), the
 133 higher its age (2) and the lower its write count (3), the higher the chance the band will be chosen
 134 for defrag.
 135
 136 # Usage {#ftl_usage}
 137
 138 ## Prerequisites {#ftl_prereq}
 139
 140 In order to use the FTL module, a device capable of zoned interface is required e.g. `zone_block`
 141 bdev or OCSSD `nvme` bdev.
 142
 143 ## FTL bdev creation {#ftl_create}
 144
 145 Similar to other bdevs, the FTL bdevs can be created either based on JSON config files or via RPC.
 146 Both interfaces require the same arguments which are described by the `--help` option of the
 147 `bdev_ftl_create` RPC call, which are:
 148
 149  - bdev's name
 150  - base bdev's name (base bdev must implement bdev_zone API)
 151  - UUID of the FTL device (if the FTL is to be restored from the SSD)
 152
 153 ## FTL usage with OCSSD nvme bdev {#ftl_ocssd}
 154
 155 This option requires an Open Channel SSD, which can be emulated using QEMU.
 156
 157 The QEMU with the patches providing Open Channel support can be found on the SPDK's QEMU fork
 158 on [spdk-3.0.0](https://github.com/spdk/qemu/tree/spdk-3.0.0) branch.
 159
 160 ## Configuring QEMU {#ftl_qemu_config}
 161
 162 To emulate an Open Channel device, QEMU expects parameters describing the characteristics and
 163 geometry of the SSD:
 164
 165  - `serial` - serial number,
 166  - `lver` - version of the OCSSD standard (0 - disabled, 1 - "1.2", 2 - "2.0"), libftl only supports
 167    2.0,
 168  - `lba_index` - default LBA format. Possible values can be found in the table below (libftl only supports lba_index >= 3):
 169  - `lnum_ch` - number of groups,
 170  - `lnum_lun` - number of parallel units
 171  - `lnum_pln` - number of planes (logical blocks from all planes constitute a chunk)
 172  - `lpgs_per_blk` - number of pages (smallest programmable unit) per chunk
 173  - `lsecs_per_pg` - number of sectors in a page
 174  - `lblks_per_pln` - number of chunks in a parallel unit
 175  - `laer_thread_sleep` - timeout in ms between asynchronous events requesting the host to relocate
 176    the data based on media feedback
 177  - `lmetadata` - metadata file
 178
 179         |lba_index| data| metadata|
 180         |---------|-----|---------|
 181         |    0    | 512B|    0B   |
 182         |    1    | 512B|    8B   |
 183         |    2    | 512B|   16B   |
 184         |    3    |4096B|    0B   |
 185         |    4    |4096B|   64B   |
 186         |    5    |4096B|  128B   |
 187         |    6    |4096B|   16B   |
 188
 189 For more detailed description of the available options, consult the `hw/block/nvme.c` file in
 190 the QEMU repository.
 191
 192 Example:
 193
 194 ```
 195 $ /path/to/qemu [OTHER PARAMETERS] -drive format=raw,file=/path/to/data/file,if=none,id=myocssd0
 196         -device nvme,drive=myocssd0,serial=deadbeef,lver=2,lba_index=3,lnum_ch=1,lnum_lun=8,lnum_pln=4,
 197         lpgs_per_blk=1536,lsecs_per_pg=4,lblks_per_pln=512,lmetadata=/path/to/md/file
 198 ```
 199
 200 In the above example, a device is created with 1 channel, 8 parallel units, 512 chunks per parallel
 201 unit, 24576 (`lnum_pln` * `lpgs_per_blk` * `lsecs_per_pg`) logical blocks in each chunk with logical
 202 block being 4096B. Therefore the data file needs to be at least 384G (8 * 512 * 24576 * 4096B) of
 203 size and can be created with the following command:
 204
 205 ```
 206 fallocate -l 384G /path/to/data/file
 207 ```
 208
 209 ## Configuring SPDK {#ftl_spdk_config}
 210
 211 To verify that the drive is emulated correctly, one can check the output of the NVMe identify app
 212 (assuming that `scripts/setup.sh` was called before and the driver has been changed for that
 213 device):
 214
 215 ```
 216 $ build/examples/identify
 217 =====================================================
 218 NVMe Controller at 0000:00:0a.0 [1d1d:1f1f]
 219 =====================================================
 220 Controller Capabilities/Features
 221 ================================
 222 Vendor ID:                             1d1d
 223 Subsystem Vendor ID:                   1af4
 224 Serial Number:                         deadbeef
 225 Model Number:                          QEMU NVMe Ctrl
 226
 227 ... other info ...
 228
 229 Namespace OCSSD Geometry
 230 =======================
 231 OC version: maj:2 min:0
 232
 233 ... other info ...
 234
 235 Groups (channels): 1
 236 PUs (LUNs) per group: 8
 237 Chunks per LUN: 512
 238 Logical blks per chunk: 24576
 239
 240 ... other info ...
 241
 242 ```
 243
 244 In order to create FTL on top Open Channel SSD, the following steps are required:
 245
 246 1) Attach OCSSD NVMe controller
 247 2) Create OCSSD bdev on the controller attached in step 1 (user could specify parallel unit range
 248 and create multiple OCSSD bdevs on single OCSSD NVMe controller)
 249 3) Create FTL bdev on top of bdev created in step 2
 250
 251 Example:
 252 ```
 253 $ scripts/rpc.py bdev_nvme_attach_controller -b nvme0 -a 00:0a.0 -t pcie
 254
 255 $ scripts/rpc.py bdev_ocssd_create -c nvme0 -b nvme0n1
 256         nvme0n1
 257
 258 $ scripts/rpc.py bdev_ftl_create -b ftl0 -d nvme0n1
 259 {
 260         "name": "ftl0",
 261         "uuid": "3b469565-1fa5-4bfb-8341-747ec9fca9b9"
 262 }
 263 ```
 264
 265 ## FTL usage with zone block bdev {#ftl_zone_block}
 266
 267 Zone block bdev is a bdev adapter between regular `bdev` and `bdev_zone`. It emulates a zoned
 268 interface on top of a regular block device.
 269
 270 In order to create FTL on top of a regular bdev:
 271 1) Create regular bdev e.g. `bdev_nvme`, `bdev_null`, `bdev_malloc`
 272 2) Create zone block bdev on top of a regular bdev created in step 1 (user could specify zone capacity
 273 and optimal number of open zones)
 274 3) Create FTL bdev on top of bdev created in step 2
 275
 276 Example:
 277 ```
 278 $ scripts/rpc.py bdev_nvme_attach_controller -b nvme0 -a 00:05.0 -t pcie
 279         nvme0n1
 280
 281 $ scripts/rpc.py bdev_zone_block_create -b zone1 -n nvme0n1 -z 4096 -o 32
 282         zone1
 283
 284 $ scripts/rpc.py bdev_ftl_create -b ftl0 -d zone1
 285 {
 286         "name": "ftl0",
 287         "uuid": "3b469565-1fa5-4bfb-8341-747ec9f3a9b9"
 288 }
 289 ```