]>
Commit | Line | Data |
---|---|---|
c6b4fcba JT |
1 | Introduction |
2 | ============ | |
3 | ||
4 | dm-cache is a device mapper target written by Joe Thornber, Heinz | |
5 | Mauelshagen, and Mike Snitzer. | |
6 | ||
7 | It aims to improve performance of a block device (eg, a spindle) by | |
8 | dynamically migrating some of its data to a faster, smaller device | |
9 | (eg, an SSD). | |
10 | ||
11 | This device-mapper solution allows us to insert this caching at | |
12 | different levels of the dm stack, for instance above the data device for | |
13 | a thin-provisioning pool. Caching solutions that are integrated more | |
14 | closely with the virtual memory system should give better performance. | |
15 | ||
16 | The target reuses the metadata library used in the thin-provisioning | |
17 | library. | |
18 | ||
19 | The decision as to what data to migrate and when is left to a plug-in | |
20 | policy module. Several of these have been written as we experiment, | |
21 | and we hope other people will contribute others for specific io | |
22 | scenarios (eg. a vm image server). | |
23 | ||
24 | Glossary | |
25 | ======== | |
26 | ||
27 | Migration - Movement of the primary copy of a logical block from one | |
28 | device to the other. | |
29 | Promotion - Migration from slow device to fast device. | |
30 | Demotion - Migration from fast device to slow device. | |
31 | ||
32 | The origin device always contains a copy of the logical block, which | |
33 | may be out of date or kept in sync with the copy on the cache device | |
34 | (depending on policy). | |
35 | ||
36 | Design | |
37 | ====== | |
38 | ||
39 | Sub-devices | |
40 | ----------- | |
41 | ||
42 | The target is constructed by passing three devices to it (along with | |
43 | other parameters detailed later): | |
44 | ||
45 | 1. An origin device - the big, slow one. | |
46 | ||
47 | 2. A cache device - the small, fast one. | |
48 | ||
49 | 3. A small metadata device - records which blocks are in the cache, | |
50 | which are dirty, and extra hints for use by the policy object. | |
51 | This information could be put on the cache device, but having it | |
52 | separate allows the volume manager to configure it differently, | |
53 | e.g. as a mirror for extra robustness. | |
54 | ||
55 | Fixed block size | |
56 | ---------------- | |
57 | ||
58 | The origin is divided up into blocks of a fixed size. This block size | |
59 | is configurable when you first create the cache. Typically we've been | |
60 | using block sizes of 256k - 1024k. | |
61 | ||
62 | Having a fixed block size simplifies the target a lot. But it is | |
63 | something of a compromise. For instance, a small part of a block may be | |
64 | getting hit a lot, yet the whole block will be promoted to the cache. | |
65 | So large block sizes are bad because they waste cache space. And small | |
66 | block sizes are bad because they increase the amount of metadata (both | |
67 | in core and on disk). | |
68 | ||
69 | Writeback/writethrough | |
70 | ---------------------- | |
71 | ||
72 | The cache has two modes, writeback and writethrough. | |
73 | ||
74 | If writeback, the default, is selected then a write to a block that is | |
75 | cached will go only to the cache and the block will be marked dirty in | |
76 | the metadata. | |
77 | ||
78 | If writethrough is selected then a write to a cached block will not | |
79 | complete until it has hit both the origin and cache devices. Clean | |
80 | blocks should remain clean. | |
81 | ||
82 | A simple cleaner policy is provided, which will clean (write back) all | |
83 | dirty blocks in a cache. Useful for decommissioning a cache. | |
84 | ||
85 | Migration throttling | |
86 | -------------------- | |
87 | ||
88 | Migrating data between the origin and cache device uses bandwidth. | |
89 | The user can set a throttle to prevent more than a certain amount of | |
f884ab15 | 90 | migration occurring at any one time. Currently we're not taking any |
c6b4fcba JT |
91 | account of normal io traffic going to the devices. More work needs |
92 | doing here to avoid migrating during those peak io moments. | |
93 | ||
94 | For the time being, a message "migration_threshold <#sectors>" | |
95 | can be used to set the maximum number of sectors being migrated, | |
96 | the default being 204800 sectors (or 100MB). | |
97 | ||
98 | Updating on-disk metadata | |
99 | ------------------------- | |
100 | ||
101 | On-disk metadata is committed every time a REQ_SYNC or REQ_FUA bio is | |
102 | written. If no such requests are made then commits will occur every | |
103 | second. This means the cache behaves like a physical disk that has a | |
104 | write cache (the same is true of the thin-provisioning target). If | |
105 | power is lost you may lose some recent writes. The metadata should | |
106 | always be consistent in spite of any crash. | |
107 | ||
108 | The 'dirty' state for a cache block changes far too frequently for us | |
109 | to keep updating it on the fly. So we treat it as a hint. In normal | |
110 | operation it will be written when the dm device is suspended. If the | |
111 | system crashes all cache blocks will be assumed dirty when restarted. | |
112 | ||
113 | Per-block policy hints | |
114 | ---------------------- | |
115 | ||
116 | Policy plug-ins can store a chunk of data per cache block. It's up to | |
117 | the policy how big this chunk is, but it should be kept small. Like the | |
118 | dirty flags this data is lost if there's a crash so a safe fallback | |
119 | value should always be possible. | |
120 | ||
121 | For instance, the 'mq' policy, which is currently the default policy, | |
122 | uses this facility to store the hit count of the cache blocks. If | |
123 | there's a crash this information will be lost, which means the cache | |
124 | may be less efficient until those hit counts are regenerated. | |
125 | ||
126 | Policy hints affect performance, not correctness. | |
127 | ||
128 | Policy messaging | |
129 | ---------------- | |
130 | ||
131 | Policies will have different tunables, specific to each one, so we | |
132 | need a generic way of getting and setting these. Device-mapper | |
133 | messages are used. Refer to cache-policies.txt. | |
134 | ||
135 | Discard bitset resolution | |
136 | ------------------------- | |
137 | ||
138 | We can avoid copying data during migration if we know the block has | |
139 | been discarded. A prime example of this is when mkfs discards the | |
140 | whole block device. We store a bitset tracking the discard state of | |
141 | blocks. However, we allow this bitset to have a different block size | |
142 | from the cache blocks. This is because we need to track the discard | |
143 | state for all of the origin device (compare with the dirty bitset | |
144 | which is just for the smaller cache device). | |
145 | ||
146 | Target interface | |
147 | ================ | |
148 | ||
149 | Constructor | |
150 | ----------- | |
151 | ||
152 | cache <metadata dev> <cache dev> <origin dev> <block size> | |
153 | <#feature args> [<feature arg>]* | |
154 | <policy> <#policy args> [policy args]* | |
155 | ||
156 | metadata dev : fast device holding the persistent metadata | |
157 | cache dev : fast device holding cached data blocks | |
158 | origin dev : slow device holding original data blocks | |
159 | block size : cache unit size in sectors | |
160 | ||
161 | #feature args : number of feature arguments passed | |
162 | feature args : writethrough. (The default is writeback.) | |
163 | ||
164 | policy : the replacement policy to use | |
165 | #policy args : an even number of arguments corresponding to | |
166 | key/value pairs passed to the policy | |
167 | policy args : key/value pairs passed to the policy | |
168 | E.g. 'sequential_threshold 1024' | |
169 | See cache-policies.txt for details. | |
170 | ||
171 | Optional feature arguments are: | |
172 | writethrough : write through caching that prohibits cache block | |
173 | content from being different from origin block content. | |
174 | Without this argument, the default behaviour is to write | |
175 | back cache block contents later for performance reasons, | |
176 | so they may differ from the corresponding origin blocks. | |
177 | ||
178 | A policy called 'default' is always registered. This is an alias for | |
179 | the policy we currently think is giving best all round performance. | |
180 | ||
181 | As the default policy could vary between kernels, if you are relying on | |
182 | the characteristics of a specific policy, always request it by name. | |
183 | ||
184 | Status | |
185 | ------ | |
186 | ||
187 | <#used metadata blocks>/<#total metadata blocks> <#read hits> <#read misses> | |
188 | <#write hits> <#write misses> <#demotions> <#promotions> <#blocks in cache> | |
189 | <#dirty> <#features> <features>* <#core args> <core args>* <#policy args> | |
190 | <policy args>* | |
191 | ||
192 | #used metadata blocks : Number of metadata blocks used | |
193 | #total metadata blocks : Total number of metadata blocks | |
194 | #read hits : Number of times a READ bio has been mapped | |
195 | to the cache | |
196 | #read misses : Number of times a READ bio has been mapped | |
197 | to the origin | |
198 | #write hits : Number of times a WRITE bio has been mapped | |
199 | to the cache | |
200 | #write misses : Number of times a WRITE bio has been | |
201 | mapped to the origin | |
202 | #demotions : Number of times a block has been removed | |
203 | from the cache | |
204 | #promotions : Number of times a block has been moved to | |
205 | the cache | |
206 | #blocks in cache : Number of blocks resident in the cache | |
207 | #dirty : Number of blocks in the cache that differ | |
208 | from the origin | |
209 | #feature args : Number of feature args to follow | |
210 | feature args : 'writethrough' (optional) | |
211 | #core args : Number of core arguments (must be even) | |
212 | core args : Key/value pairs for tuning the core | |
213 | e.g. migration_threshold | |
214 | #policy args : Number of policy arguments to follow (must be even) | |
215 | policy args : Key/value pairs | |
216 | e.g. 'sequential_threshold 1024 | |
217 | ||
218 | Messages | |
219 | -------- | |
220 | ||
221 | Policies will have different tunables, specific to each one, so we | |
222 | need a generic way of getting and setting these. Device-mapper | |
223 | messages are used. (A sysfs interface would also be possible.) | |
224 | ||
225 | The message format is: | |
226 | ||
227 | <key> <value> | |
228 | ||
229 | E.g. | |
230 | dmsetup message my_cache 0 sequential_threshold 1024 | |
231 | ||
232 | Examples | |
233 | ======== | |
234 | ||
235 | The test suite can be found here: | |
236 | ||
237 | https://github.com/jthornber/thinp-test-suite | |
238 | ||
239 | dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ | |
240 | /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0' | |
241 | dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ | |
242 | /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \ | |
243 | mq 4 sequential_threshold 1024 random_threshold 8' |