]>
Commit | Line | Data |
---|---|---|
3b1a94c8 DLM |
1 | dm-zoned |
2 | ======== | |
3 | ||
4 | The dm-zoned device mapper target exposes a zoned block device (ZBC and | |
5 | ZAC compliant devices) as a regular block device without any write | |
6 | pattern constraints. In effect, it implements a drive-managed zoned | |
7 | block device which hides from the user (a file system or an application | |
8 | doing raw block device accesses) the sequential write constraints of | |
9 | host-managed zoned block devices and can mitigate the potential | |
10 | device-side performance degradation due to excessive random writes on | |
11 | host-aware zoned block devices. | |
12 | ||
13 | For a more detailed description of the zoned block device models and | |
14 | their constraints see (for SCSI devices): | |
15 | ||
16 | http://www.t10.org/drafts.htm#ZBC_Family | |
17 | ||
18 | and (for ATA devices): | |
19 | ||
20 | http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf | |
21 | ||
22 | The dm-zoned implementation is simple and minimizes system overhead (CPU | |
23 | and memory usage as well as storage capacity loss). For a 10TB | |
24 | host-managed disk with 256 MB zones, dm-zoned memory usage per disk | |
25 | instance is at most 4.5 MB and as little as 5 zones will be used | |
26 | internally for storing metadata and performaing reclaim operations. | |
27 | ||
28 | dm-zoned target devices are formatted and checked using the dmzadm | |
29 | utility available at: | |
30 | ||
31 | https://github.com/hgst/dm-zoned-tools | |
32 | ||
33 | Algorithm | |
34 | ========= | |
35 | ||
36 | dm-zoned implements an on-disk buffering scheme to handle non-sequential | |
37 | write accesses to the sequential zones of a zoned block device. | |
38 | Conventional zones are used for caching as well as for storing internal | |
39 | metadata. | |
40 | ||
41 | The zones of the device are separated into 2 types: | |
42 | ||
43 | 1) Metadata zones: these are conventional zones used to store metadata. | |
44 | Metadata zones are not reported as useable capacity to the user. | |
45 | ||
46 | 2) Data zones: all remaining zones, the vast majority of which will be | |
47 | sequential zones used exclusively to store user data. The conventional | |
48 | zones of the device may be used also for buffering user random writes. | |
49 | Data in these zones may be directly mapped to the conventional zone, but | |
50 | later moved to a sequential zone so that the conventional zone can be | |
51 | reused for buffering incoming random writes. | |
52 | ||
53 | dm-zoned exposes a logical device with a sector size of 4096 bytes, | |
54 | irrespective of the physical sector size of the backend zoned block | |
55 | device being used. This allows reducing the amount of metadata needed to | |
56 | manage valid blocks (blocks written). | |
57 | ||
58 | The on-disk metadata format is as follows: | |
59 | ||
60 | 1) The first block of the first conventional zone found contains the | |
61 | super block which describes the on disk amount and position of metadata | |
62 | blocks. | |
63 | ||
64 | 2) Following the super block, a set of blocks is used to describe the | |
65 | mapping of the logical device blocks. The mapping is done per chunk of | |
66 | blocks, with the chunk size equal to the zoned block device size. The | |
67 | mapping table is indexed by chunk number and each mapping entry | |
68 | indicates the zone number of the device storing the chunk of data. Each | |
69 | mapping entry may also indicate if the zone number of a conventional | |
70 | zone used to buffer random modification to the data zone. | |
71 | ||
72 | 3) A set of blocks used to store bitmaps indicating the validity of | |
73 | blocks in the data zones follows the mapping table. A valid block is | |
74 | defined as a block that was written and not discarded. For a buffered | |
75 | data chunk, a block is always valid only in the data zone mapping the | |
76 | chunk or in the buffer zone of the chunk. | |
77 | ||
78 | For a logical chunk mapped to a conventional zone, all write operations | |
79 | are processed by directly writing to the zone. If the mapping zone is a | |
80 | sequential zone, the write operation is processed directly only if the | |
81 | write offset within the logical chunk is equal to the write pointer | |
82 | offset within of the sequential data zone (i.e. the write operation is | |
83 | aligned on the zone write pointer). Otherwise, write operations are | |
84 | processed indirectly using a buffer zone. In that case, an unused | |
85 | conventional zone is allocated and assigned to the chunk being | |
86 | accessed. Writing a block to the buffer zone of a chunk will | |
87 | automatically invalidate the same block in the sequential zone mapping | |
88 | the chunk. If all blocks of the sequential zone become invalid, the zone | |
89 | is freed and the chunk buffer zone becomes the primary zone mapping the | |
90 | chunk, resulting in native random write performance similar to a regular | |
91 | block device. | |
92 | ||
93 | Read operations are processed according to the block validity | |
94 | information provided by the bitmaps. Valid blocks are read either from | |
95 | the sequential zone mapping a chunk, or if the chunk is buffered, from | |
96 | the buffer zone assigned. If the accessed chunk has no mapping, or the | |
97 | accessed blocks are invalid, the read buffer is zeroed and the read | |
98 | operation terminated. | |
99 | ||
100 | After some time, the limited number of convnetional zones available may | |
101 | be exhausted (all used to map chunks or buffer sequential zones) and | |
102 | unaligned writes to unbuffered chunks become impossible. To avoid this | |
103 | situation, a reclaim process regularly scans used conventional zones and | |
104 | tries to reclaim the least recently used zones by copying the valid | |
105 | blocks of the buffer zone to a free sequential zone. Once the copy | |
106 | completes, the chunk mapping is updated to point to the sequential zone | |
107 | and the buffer zone freed for reuse. | |
108 | ||
109 | Metadata Protection | |
110 | =================== | |
111 | ||
112 | To protect metadata against corruption in case of sudden power loss or | |
113 | system crash, 2 sets of metadata zones are used. One set, the primary | |
114 | set, is used as the main metadata region, while the secondary set is | |
115 | used as a staging area. Modified metadata is first written to the | |
116 | secondary set and validated by updating the super block in the secondary | |
117 | set, a generation counter is used to indicate that this set contains the | |
118 | newest metadata. Once this operation completes, in place of metadata | |
119 | block updates can be done in the primary metadata set. This ensures that | |
120 | one of the set is always consistent (all modifications committed or none | |
121 | at all). Flush operations are used as a commit point. Upon reception of | |
122 | a flush request, metadata modification activity is temporarily blocked | |
123 | (for both incoming BIO processing and reclaim process) and all dirty | |
124 | metadata blocks are staged and updated. Normal operation is then | |
125 | resumed. Flushing metadata thus only temporarily delays write and | |
126 | discard requests. Read requests can be processed concurrently while | |
127 | metadata flush is being executed. | |
128 | ||
129 | Usage | |
130 | ===== | |
131 | ||
132 | A zoned block device must first be formatted using the dmzadm tool. This | |
133 | will analyze the device zone configuration, determine where to place the | |
134 | metadata sets on the device and initialize the metadata sets. | |
135 | ||
136 | Ex: | |
137 | ||
138 | dmzadm --format /dev/sdxx | |
139 | ||
140 | For a formatted device, the target can be created normally with the | |
141 | dmsetup utility. The only parameter that dm-zoned requires is the | |
142 | underlying zoned block device name. Ex: | |
143 | ||
144 | echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | dmsetup create dmz-`basename ${dev}` |