]>
Commit | Line | Data |
---|---|---|
991d9fa0 JT |
1 | Introduction |
2 | ============ | |
3 | ||
4998d8ed | 4 | This document describes a collection of device-mapper targets that |
991d9fa0 JT |
5 | between them implement thin-provisioning and snapshots. |
6 | ||
7 | The main highlight of this implementation, compared to the previous | |
8 | implementation of snapshots, is that it allows many virtual devices to | |
9 | be stored on the same data volume. This simplifies administration and | |
10 | allows the sharing of data between volumes, thus reducing disk usage. | |
11 | ||
12 | Another significant feature is support for an arbitrary depth of | |
13 | recursive snapshots (snapshots of snapshots of snapshots ...). The | |
14 | previous implementation of snapshots did this by chaining together | |
15 | lookup tables, and so performance was O(depth). This new | |
16 | implementation uses a single data structure to avoid this degradation | |
17 | with depth. Fragmentation may still be an issue, however, in some | |
18 | scenarios. | |
19 | ||
20 | Metadata is stored on a separate device from data, giving the | |
21 | administrator some freedom, for example to: | |
22 | ||
23 | - Improve metadata resilience by storing metadata on a mirrored volume | |
24 | but data on a non-mirrored one. | |
25 | ||
26 | - Improve performance by storing the metadata on SSD. | |
27 | ||
28 | Status | |
29 | ====== | |
30 | ||
6c7413c0 MS |
31 | These targets are considered safe for production use. But different use |
32 | cases will have different performance characteristics, for example due | |
33 | to fragmentation of the data volume. | |
991d9fa0 JT |
34 | |
35 | If you find this software is not performing as expected please mail | |
36 | dm-devel@redhat.com with details and we'll try our best to improve | |
37 | things for you. | |
38 | ||
6c7413c0 MS |
39 | Userspace tools for checking and repairing the metadata have been fully |
40 | developed and are available as 'thin_check' and 'thin_repair'. The name | |
41 | of the package that provides these utilities varies by distribution (on | |
42 | a Red Hat distribution it is named 'device-mapper-persistent-data'). | |
991d9fa0 JT |
43 | |
44 | Cookbook | |
45 | ======== | |
46 | ||
47 | This section describes some quick recipes for using thin provisioning. | |
48 | They use the dmsetup program to control the device-mapper driver | |
49 | directly. End users will be advised to use a higher-level volume | |
50 | manager such as LVM2 once support has been added. | |
51 | ||
52 | Pool device | |
53 | ----------- | |
54 | ||
55 | The pool device ties together the metadata volume and the data volume. | |
56 | It maps I/O linearly to the data volume and updates the metadata via | |
57 | two mechanisms: | |
58 | ||
59 | - Function calls from the thin targets | |
60 | ||
61 | - Device-mapper 'messages' from userspace which control the creation of new | |
62 | virtual devices amongst other things. | |
63 | ||
64 | Setting up a fresh pool device | |
65 | ------------------------------ | |
66 | ||
67 | Setting up a pool device requires a valid metadata device, and a | |
68 | data device. If you do not have an existing metadata device you can | |
69 | make one by zeroing the first 4k to indicate empty metadata. | |
70 | ||
71 | dd if=/dev/zero of=$metadata_dev bs=4096 count=1 | |
72 | ||
73 | The amount of metadata you need will vary according to how many blocks | |
74 | are shared between thin devices (i.e. through snapshots). If you have | |
75 | less sharing than average you'll need a larger-than-average metadata device. | |
76 | ||
77 | As a guide, we suggest you calculate the number of bytes to use in the | |
78 | metadata device as 48 * $data_dev_size / $data_block_size but round it up | |
c4a69ecd MS |
79 | to 2MB if the answer is smaller. If you're creating large numbers of |
80 | snapshots which are recording large amounts of change, you may find you | |
81 | need to increase this. | |
991d9fa0 | 82 | |
c4a69ecd MS |
83 | The largest size supported is 16GB: If the device is larger, |
84 | a warning will be issued and the excess space will not be used. | |
991d9fa0 JT |
85 | |
86 | Reloading a pool table | |
87 | ---------------------- | |
88 | ||
89 | You may reload a pool's table, indeed this is how the pool is resized | |
90 | if it runs out of space. (N.B. While specifying a different metadata | |
91 | device when reloading is not forbidden at the moment, things will go | |
92 | wrong if it does not route I/O to exactly the same on-disk location as | |
93 | previously.) | |
94 | ||
95 | Using an existing pool device | |
96 | ----------------------------- | |
97 | ||
98 | dmsetup create pool \ | |
99 | --table "0 20971520 thin-pool $metadata_dev $data_dev \ | |
100 | $data_block_size $low_water_mark" | |
101 | ||
102 | $data_block_size gives the smallest unit of disk space that can be | |
a561ddbe CM |
103 | allocated at a time expressed in units of 512-byte sectors. |
104 | $data_block_size must be between 128 (64KB) and 2097152 (1GB) and a | |
105 | multiple of 128 (64KB). $data_block_size cannot be changed after the | |
106 | thin-pool is created. People primarily interested in thin provisioning | |
107 | may want to use a value such as 1024 (512KB). People doing lots of | |
108 | snapshotting may want a smaller value such as 128 (64KB). If you are | |
109 | not zeroing newly-allocated data, a larger $data_block_size in the | |
110 | region of 256000 (128MB) is suggested. | |
991d9fa0 JT |
111 | |
112 | $low_water_mark is expressed in blocks of size $data_block_size. If | |
113 | free space on the data device drops below this level then a dm event | |
114 | will be triggered which a userspace daemon should catch allowing it to | |
115 | extend the pool device. Only one such event will be sent. | |
9b28a110 | 116 | |
117 | No special event is triggered if a just resumed device's free space is below | |
118 | the low water mark. However, resuming a device always triggers an | |
119 | event; a userspace daemon should verify that free space exceeds the low | |
120 | water mark when handling this event. | |
991d9fa0 | 121 | |
07f2b6e0 MS |
122 | A low water mark for the metadata device is maintained in the kernel and |
123 | will trigger a dm event if free space on the metadata device drops below | |
124 | it. | |
125 | ||
126 | Updating on-disk metadata | |
127 | ------------------------- | |
128 | ||
129 | On-disk metadata is committed every time a FLUSH or FUA bio is written. | |
130 | If no such requests are made then commits will occur every second. This | |
131 | means the thin-provisioning target behaves like a physical disk that has | |
132 | a volatile write cache. If power is lost you may lose some recent | |
133 | writes. The metadata should always be consistent in spite of any crash. | |
134 | ||
135 | If data space is exhausted the pool will either error or queue IO | |
136 | according to the configuration (see: error_if_no_space). If metadata | |
137 | space is exhausted or a metadata operation fails: the pool will error IO | |
138 | until the pool is taken offline and repair is performed to 1) fix any | |
139 | potential inconsistencies and 2) clear the flag that imposes repair. | |
140 | Once the pool's metadata device is repaired it may be resized, which | |
141 | will allow the pool to return to normal operation. Note that if a pool | |
142 | is flagged as needing repair, the pool's data and metadata devices | |
143 | cannot be resized until repair is performed. It should also be noted | |
144 | that when the pool's metadata space is exhausted the current metadata | |
145 | transaction is aborted. Given that the pool will cache IO whose | |
146 | completion may have already been acknowledged to upper IO layers | |
147 | (e.g. filesystem) it is strongly suggested that consistency checks | |
148 | (e.g. fsck) be performed on those layers when repair of the pool is | |
149 | required. | |
150 | ||
991d9fa0 JT |
151 | Thin provisioning |
152 | ----------------- | |
153 | ||
154 | i) Creating a new thinly-provisioned volume. | |
155 | ||
156 | To create a new thinly- provisioned volume you must send a message to an | |
157 | active pool device, /dev/mapper/pool in this example. | |
158 | ||
159 | dmsetup message /dev/mapper/pool 0 "create_thin 0" | |
160 | ||
161 | Here '0' is an identifier for the volume, a 24-bit number. It's up | |
162 | to the caller to allocate and manage these identifiers. If the | |
163 | identifier is already in use, the message will fail with -EEXIST. | |
164 | ||
165 | ii) Using a thinly-provisioned volume. | |
166 | ||
167 | Thinly-provisioned volumes are activated using the 'thin' target: | |
168 | ||
169 | dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0" | |
170 | ||
171 | The last parameter is the identifier for the thinp device. | |
172 | ||
173 | Internal snapshots | |
174 | ------------------ | |
175 | ||
176 | i) Creating an internal snapshot. | |
177 | ||
178 | Snapshots are created with another message to the pool. | |
179 | ||
180 | N.B. If the origin device that you wish to snapshot is active, you | |
181 | must suspend it before creating the snapshot to avoid corruption. | |
182 | This is NOT enforced at the moment, so please be careful! | |
183 | ||
184 | dmsetup suspend /dev/mapper/thin | |
185 | dmsetup message /dev/mapper/pool 0 "create_snap 1 0" | |
186 | dmsetup resume /dev/mapper/thin | |
187 | ||
188 | Here '1' is the identifier for the volume, a 24-bit number. '0' is the | |
189 | identifier for the origin device. | |
190 | ||
191 | ii) Using an internal snapshot. | |
192 | ||
193 | Once created, the user doesn't have to worry about any connection | |
194 | between the origin and the snapshot. Indeed the snapshot is no | |
195 | different from any other thinly-provisioned device and can be | |
196 | snapshotted itself via the same method. It's perfectly legal to | |
197 | have only one of them active, and there's no ordering requirement on | |
198 | activating or removing them both. (This differs from conventional | |
199 | device-mapper snapshots.) | |
200 | ||
201 | Activate it exactly the same way as any other thinly-provisioned volume: | |
202 | ||
203 | dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1" | |
204 | ||
2dd9c257 JT |
205 | External snapshots |
206 | ------------------ | |
207 | ||
208 | You can use an external _read only_ device as an origin for a | |
209 | thinly-provisioned volume. Any read to an unprovisioned area of the | |
210 | thin device will be passed through to the origin. Writes trigger | |
211 | the allocation of new blocks as usual. | |
212 | ||
213 | One use case for this is VM hosts that want to run guests on | |
214 | thinly-provisioned volumes but have the base image on another device | |
215 | (possibly shared between many VMs). | |
216 | ||
217 | You must not write to the origin device if you use this technique! | |
218 | Of course, you may write to the thin device and take internal snapshots | |
219 | of the thin volume. | |
220 | ||
221 | i) Creating a snapshot of an external device | |
222 | ||
223 | This is the same as creating a thin device. | |
224 | You don't mention the origin at this stage. | |
225 | ||
226 | dmsetup message /dev/mapper/pool 0 "create_thin 0" | |
227 | ||
228 | ii) Using a snapshot of an external device. | |
229 | ||
230 | Append an extra parameter to the thin target specifying the origin: | |
231 | ||
232 | dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image" | |
233 | ||
234 | N.B. All descendants (internal snapshots) of this snapshot require the | |
235 | same extra origin parameter. | |
236 | ||
991d9fa0 JT |
237 | Deactivation |
238 | ------------ | |
239 | ||
240 | All devices using a pool must be deactivated before the pool itself | |
241 | can be. | |
242 | ||
243 | dmsetup remove thin | |
244 | dmsetup remove snap | |
245 | dmsetup remove pool | |
246 | ||
247 | Reference | |
248 | ========= | |
249 | ||
250 | 'thin-pool' target | |
251 | ------------------ | |
252 | ||
253 | i) Constructor | |
254 | ||
255 | thin-pool <metadata dev> <data dev> <data block size (sectors)> \ | |
256 | <low water mark (blocks)> [<number of feature args> [<arg>]*] | |
257 | ||
258 | Optional feature arguments: | |
67e2e2b2 JT |
259 | |
260 | skip_block_zeroing: Skip the zeroing of newly-provisioned blocks. | |
261 | ||
262 | ignore_discard: Disable discard support. | |
263 | ||
264 | no_discard_passdown: Don't pass discards down to the underlying | |
265 | data device, but just remove the mapping. | |
991d9fa0 | 266 | |
e49e5829 | 267 | read_only: Don't allow any changes to be made to the pool |
28700a36 MS |
268 | metadata. This mode is only available after the |
269 | thin-pool has been created and first used in full | |
270 | read/write mode. It cannot be specified on initial | |
271 | thin-pool creation. | |
e49e5829 | 272 | |
787a996c MS |
273 | error_if_no_space: Error IOs, instead of queueing, if no space. |
274 | ||
991d9fa0 JT |
275 | Data block size must be between 64KB (128 sectors) and 1GB |
276 | (2097152 sectors) inclusive. | |
277 | ||
278 | ||
279 | ii) Status | |
280 | ||
281 | <transaction id> <used metadata blocks>/<total metadata blocks> | |
282 | <used data blocks>/<total data blocks> <held metadata root> | |
7efd5fed | 283 | ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space |
63c8ecb6 | 284 | needs_check|- metadata_low_watermark |
991d9fa0 JT |
285 | |
286 | transaction id: | |
287 | A 64-bit number used by userspace to help synchronise with metadata | |
288 | from volume managers. | |
289 | ||
290 | used data blocks / total data blocks | |
291 | If the number of free blocks drops below the pool's low water mark a | |
292 | dm event will be sent to userspace. This event is edge-triggered and | |
293 | it will occur only once after each resume so volume manager writers | |
294 | should register for the event and then check the target's status. | |
295 | ||
296 | held metadata root: | |
f6d16d32 | 297 | The location, in blocks, of the metadata root that has been |
991d9fa0 | 298 | 'held' for userspace read access. '-' indicates there is no |
f6d16d32 | 299 | held root. |
991d9fa0 | 300 | |
e49e5829 JT |
301 | discard_passdown|no_discard_passdown |
302 | Whether or not discards are actually being passed down to the | |
303 | underlying device. When this is enabled when loading the table, | |
304 | it can get disabled if the underlying device doesn't support it. | |
305 | ||
e4c78e21 | 306 | ro|rw|out_of_data_space |
e49e5829 JT |
307 | If the pool encounters certain types of device failures it will |
308 | drop into a read-only metadata mode in which no changes to | |
309 | the pool metadata (like allocating new blocks) are permitted. | |
310 | ||
311 | In serious cases where even a read-only mode is deemed unsafe | |
312 | no further I/O will be permitted and the status will just | |
313 | contain the string 'Fail'. The userspace recovery tools | |
314 | should then be used. | |
315 | ||
787a996c MS |
316 | error_if_no_space|queue_if_no_space |
317 | If the pool runs out of data or metadata space, the pool will | |
318 | either queue or error the IO destined to the data device. The | |
80c57893 MS |
319 | default is to queue the IO until more space is added or the |
320 | 'no_space_timeout' expires. The 'no_space_timeout' dm-thin-pool | |
321 | module parameter can be used to change this timeout -- it | |
322 | defaults to 60 seconds but may be disabled using a value of 0. | |
787a996c | 323 | |
e4c78e21 MS |
324 | needs_check |
325 | A metadata operation has failed, resulting in the needs_check | |
326 | flag being set in the metadata's superblock. The metadata | |
327 | device must be deactivated and checked/repaired before the | |
328 | thin-pool can be made fully operational again. '-' indicates | |
329 | needs_check is not set. | |
330 | ||
63c8ecb6 AG |
331 | metadata_low_watermark: |
332 | Value of metadata low watermark in blocks. The kernel sets this | |
333 | value internally but userspace needs to know this value to | |
334 | determine if an event was caused by crossing this threshold. | |
335 | ||
991d9fa0 JT |
336 | iii) Messages |
337 | ||
338 | create_thin <dev id> | |
339 | ||
340 | Create a new thinly-provisioned device. | |
341 | <dev id> is an arbitrary unique 24-bit identifier chosen by | |
342 | the caller. | |
343 | ||
344 | create_snap <dev id> <origin id> | |
345 | ||
346 | Create a new snapshot of another thinly-provisioned device. | |
347 | <dev id> is an arbitrary unique 24-bit identifier chosen by | |
348 | the caller. | |
349 | <origin id> is the identifier of the thinly-provisioned device | |
350 | of which the new device will be a snapshot. | |
351 | ||
352 | delete <dev id> | |
353 | ||
354 | Deletes a thin device. Irreversible. | |
355 | ||
991d9fa0 JT |
356 | set_transaction_id <current id> <new id> |
357 | ||
358 | Userland volume managers, such as LVM, need a way to | |
359 | synchronise their external metadata with the internal metadata of the | |
360 | pool target. The thin-pool target offers to store an | |
361 | arbitrary 64-bit transaction id and return it on the target's | |
362 | status line. To avoid races you must provide what you think | |
363 | the current transaction id is when you change it with this | |
364 | compare-and-swap message. | |
365 | ||
cc8394d8 JT |
366 | reserve_metadata_snap |
367 | ||
368 | Reserve a copy of the data mapping btree for use by userland. | |
369 | This allows userland to inspect the mappings as they were when | |
370 | this message was executed. Use the pool's status command to | |
371 | get the root block associated with the metadata snapshot. | |
372 | ||
373 | release_metadata_snap | |
374 | ||
375 | Release a previously reserved copy of the data mapping btree. | |
376 | ||
991d9fa0 JT |
377 | 'thin' target |
378 | ------------- | |
379 | ||
380 | i) Constructor | |
381 | ||
2dd9c257 | 382 | thin <pool dev> <dev id> [<external origin dev>] |
991d9fa0 JT |
383 | |
384 | pool dev: | |
385 | the thin-pool device, e.g. /dev/mapper/my_pool or 253:0 | |
386 | ||
387 | dev id: | |
388 | the internal device identifier of the device to be | |
389 | activated. | |
390 | ||
2dd9c257 JT |
391 | external origin dev: |
392 | an optional block device outside the pool to be treated as a | |
393 | read-only snapshot origin: reads to unprovisioned areas of the | |
394 | thin target will be mapped to this device. | |
395 | ||
991d9fa0 JT |
396 | The pool doesn't store any size against the thin devices. If you |
397 | load a thin target that is smaller than you've been using previously, | |
398 | then you'll have no access to blocks mapped beyond the end. If you | |
399 | load a target that is bigger than before, then extra blocks will be | |
400 | provisioned as and when needed. | |
401 | ||
991d9fa0 JT |
402 | ii) Status |
403 | ||
404 | <nr mapped sectors> <highest mapped sector> | |
e49e5829 JT |
405 | |
406 | If the pool has encountered device errors and failed, the status | |
407 | will just contain the string 'Fail'. The userspace recovery | |
408 | tools should then be used. | |
2bc8a61c | 409 | |
410 | In the case where <nr mapped sectors> is 0, there is no highest | |
411 | mapped sector and the value of <highest mapped sector> is unspecified. |