]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | ============ |
2 | RBD Layering | |
3 | ============ | |
4 | ||
5 | RBD layering refers to the creation of copy-on-write clones of block | |
6 | devices. This allows for fast image creation, for example to clone a | |
7 | golden master image of a virtual machine into a new instance. To | |
8 | simplify the semantics, you can only create a clone of a snapshot - | |
9 | snapshots are always read-only, so the rest of the image is | |
10 | unaffected, and there's no possibility of writing to them | |
11 | accidentally. | |
12 | ||
13 | From a user's perspective, a clone is just like any other rbd image. | |
14 | You can take snapshots of them, read/write them, resize them, etc. | |
15 | There are no restrictions on clones from a user's viewpoint. | |
16 | ||
17 | Note: the terms `child` and `parent` below mean an rbd image created | |
18 | by cloning, and the rbd image snapshot a child was cloned from. | |
19 | ||
20 | Command line interface | |
21 | ---------------------- | |
22 | ||
23 | Before cloning a snapshot, you must mark it as protected, to prevent | |
24 | it from being deleted while child images refer to it: | |
25 | :: | |
26 | ||
27 | $ rbd snap protect pool/image@snap | |
28 | ||
29 | Then you can perform the clone: | |
30 | :: | |
31 | ||
32 | $ rbd clone [--parent] pool/parent@snap [--image] pool2/child1 | |
33 | ||
34 | You can create a clone with different object sizes from the parent: | |
35 | :: | |
36 | ||
37 | $ rbd clone --order 25 pool/parent@snap pool2/child2 | |
38 | ||
39 | To delete the parent, you must first mark it unprotected, which checks | |
40 | that there are no children left: | |
41 | :: | |
42 | ||
43 | $ rbd snap unprotect pool/image@snap | |
44 | Cannot unprotect: Still in use by pool2/image2 | |
45 | $ rbd children pool/image@snap | |
46 | pool2/child1 | |
47 | pool2/child2 | |
48 | $ rbd flatten pool2/child1 | |
49 | $ rbd rm pool2/child2 | |
50 | $ rbd snap rm pool/image@snap | |
51 | Cannot remove a protected snapshot: pool/image@snap | |
52 | $ rbd snap unprotect pool/image@snap | |
53 | ||
54 | Then the snapshot can be deleted like normal: | |
55 | :: | |
56 | ||
57 | $ rbd snap rm pool/image@snap | |
58 | ||
59 | Implementation | |
60 | -------------- | |
61 | ||
62 | Data Flow | |
63 | ^^^^^^^^^ | |
64 | ||
65 | In the initial implementation, called 'trivial layering', there will | |
66 | be no tracking of which objects exist in a clone. A read that hits a | |
67 | non-existent object will attempt to read from the parent snapshot, and | |
68 | this will continue recursively until an object exists or an image with | |
69 | no parent is found. This is done through the normal read path from | |
70 | the parent, so differing object sizes between parents and children | |
71 | do not matter. | |
72 | ||
73 | Before a write to an object is performed, the object is checked for | |
74 | existence. If it doesn't exist, a copy-up operation is performed, | |
75 | which means reading the relevant range of data from the parent | |
76 | snapshot and writing it (plus the original write) to the child | |
77 | image. To prevent races with multiple writes trying to copy-up the | |
78 | same object, this copy-up operation will include an atomic create. If | |
79 | the atomic create fails, the original write is done instead. This | |
80 | copy-up operation is implemented as a class method so that extra | |
81 | metadata can be stored by it in the future. In trivial layering, the | |
82 | copy-up operation copies the entire range needed to the child object | |
83 | (that is, the full size of the child object). A future optimization | |
84 | could make this copy-up more fine-grained. | |
85 | ||
86 | Another future optimization could be storing a bitmap of which objects | |
87 | actually exist in a child. This would obviate the check for existence | |
88 | before each write, and let reads go directly to the parent if needed. | |
89 | ||
90 | These optimizations are discussed in: | |
91 | ||
92 | http://marc.info/?l=ceph-devel&m=129867273303846 | |
93 | ||
94 | Parent/Child relationships | |
95 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
96 | ||
97 | Children store a reference to their parent in their header, as a tuple | |
98 | of (pool id, image id, snapshot id). This is enough information to | |
99 | open the parent and read from it. | |
100 | ||
101 | In addition to knowing which parent a given image has, we want to be | |
102 | able to tell if a protected snapshot still has children. This is | |
103 | accomplished with a new per-pool object, `rbd_children`, which maps | |
104 | (parent pool id, parent image id, parent snapshot id) to a list of | |
105 | child image ids. This is stored in the same pool as the child image | |
106 | because the client creating a clone already has read/write access to | |
107 | everything in this pool, but may not have write access to the parent's | |
108 | pool. This lets a client with read-only access to one pool clone a | |
109 | snapshot from that pool into a pool they have full access to. It | |
110 | increases the cost of unprotecting an image, since this needs to check | |
111 | for children in every pool, but this is a rare operation. It would | |
112 | likely only be done before removing old images, which is already much | |
113 | more expensive because it involves deleting every data object in the | |
114 | image. | |
115 | ||
116 | Protection | |
117 | ^^^^^^^^^^ | |
118 | ||
119 | Internally, protection_state is a field in the header object that | |
120 | can be in three states. "protected", "unprotected", and | |
121 | "unprotecting". The first two are set as the result of "rbd | |
122 | protect/unprotect". The "unprotecting" state is set while the "rbd | |
123 | unprotect" command checks for any child images. Only snapshots in the | |
124 | "protected" state may be cloned, so the "unprotected" state prevents | |
125 | a race like: | |
126 | ||
127 | 1. A: walk through all pools, look for clones, find none | |
128 | 2. B: create a clone | |
129 | 3. A: unprotect parent | |
130 | 4. A: rbd snap rm pool/parent@snap | |
131 | ||
132 | Resizing | |
133 | ^^^^^^^^ | |
134 | ||
135 | Resizing an rbd image is like truncating a sparse file. New space is | |
136 | treated as zeroes, and shrinking an rbd image deletes the contents | |
137 | beyond the old bounds. This means that if you have a 10G image full of | |
138 | data, and you resize it down to 5G and then up to 10G again, the last | |
139 | 5G is treated as zeroes (and any objects that held that data were | |
140 | removed when the image was shrunk). | |
141 | ||
142 | Layering complicates this because the absence of an object no longer | |
143 | implies it should be treated as zeroes - if the object is part of a | |
144 | clone, it may mean that some data needs to be read from the parent. | |
145 | ||
146 | To preserve the resizing behavior for clones, we need to keep track of | |
147 | which objects could be stored in the parent. We can track this as the | |
148 | amount of overlap the child has with the parent, since resizing only | |
149 | changes the end of an image. When a child is created, its overlap | |
150 | is the size of the parent snapshot. On each subsequent resize, the | |
151 | overlap is `min(overlap, new_size)`. That is, shrinking the image | |
152 | may shrinks the overlap, but increasing the image's size does not | |
153 | change the overlap. | |
154 | ||
155 | Objects that do not exist past the overlap are treated as zeroes. | |
156 | Objects that do not exist before that point fall back to reading | |
157 | from the parent. | |
158 | ||
159 | Since this overlap changes over time, we store it as part of the | |
160 | metadata for a snapshot as well. | |
161 | ||
162 | Renaming | |
163 | ^^^^^^^^ | |
164 | ||
165 | Currently the rbd header object (that stores all the metadata about an | |
166 | image) is named after the name of the image. This makes renaming | |
167 | disrupt clients who have the image open (such as children reading from | |
168 | a parent). To avoid this, we can name the header object by the | |
169 | id of the image, which does not change. That is, the name of the | |
170 | header object could be `rbd_header.$id`, where $id is a unique id for | |
171 | the image in the pool. | |
172 | ||
173 | When a client opens an image, all it knows is the name. There is | |
174 | already a per-pool `rbd_directory` object that maps image names to | |
175 | ids, but if we relied on it to get the id, we could not open any | |
176 | images in that pool if that single object was unavailable. To avoid | |
177 | this dependency, we can store the id of an image in an object called | |
178 | `rbd_id.$image_name`, where $image_name is the name of the image. The | |
179 | per-pool `rbd_directory` object is still useful for listing all images | |
180 | in a pool, however. | |
181 | ||
182 | Header changes | |
183 | -------------- | |
184 | ||
185 | The header needs a few new fields: | |
186 | ||
187 | * int64_t parent_pool_id | |
188 | * string parent_image_id | |
189 | * uint64_t parent_snap_id | |
190 | * uint64_t overlap (how much of the image may be referring to the parent) | |
191 | ||
192 | These are stored in a "parent" key, which is only present if the image | |
193 | has a parent. | |
194 | ||
195 | cls_rbd | |
196 | ^^^^^^^ | |
197 | ||
198 | Some new methods are needed: | |
199 | :: | |
200 | ||
201 | /***************** methods on the rbd header *********************/ | |
202 | /** | |
203 | * Sets the parent and overlap keys. | |
204 | * Fails if any of these keys exist, since the image already | |
205 | * had a parent. | |
206 | */ | |
207 | set_parent(uint64_t pool_id, string image_id, uint64_t snap_id) | |
208 | ||
209 | /** | |
210 | * returns the parent pool id, image id, snap id, and overlap, or -ENOENT | |
211 | * if parent_pool_id does not exist or is -1 | |
212 | */ | |
213 | get_parent(uint64_t snapid) | |
214 | ||
215 | /** | |
216 | * Removes the parent key | |
217 | */ | |
218 | remove_parent() // after all parent data is copied to the child | |
219 | ||
220 | /*************** methods on the rbd_children object *****************/ | |
221 | ||
222 | add_child(uint64_t parent_pool_id, string parent_image_id, | |
223 | uint64_t parent_snap_id, string image_id); | |
224 | remove_child(uint64_t parent_pool_id, string parent_image_id, | |
225 | uint64_t parent_snap_id, string image_id); | |
226 | /** | |
227 | * List ids of a given parent | |
228 | */ | |
229 | get_children(uint64_t parent_pool_id, string parent_image_id, | |
230 | uint64_t parent_snap_id, uint64_t max_return, | |
231 | string start); | |
232 | /** | |
233 | * list parent | |
234 | */ | |
235 | get_parents(uint64_t max_return, uint64_t start_pool_id, | |
236 | string start_image_id, string start_snap_id); | |
237 | ||
238 | ||
239 | /************ methods on the rbd_id.$image_name object **************/ | |
240 | ||
241 | set_id(string id) | |
242 | get_id() | |
243 | ||
244 | /************** methods on the rbd_directory object *****************/ | |
245 | ||
246 | dir_get_id(string name); | |
247 | dir_get_name(string id); | |
248 | dir_list(string start_after, uint64_t max_return); | |
249 | dir_add_image(string name, string id); | |
250 | dir_remove_image(string name, string id); | |
251 | dir_rename_image(string src, string dest, string id); | |
252 | ||
253 | Two existing methods will change if the image supports | |
254 | layering: | |
255 | :: | |
256 | ||
257 | snapshot_add - stores current overlap and has_parent with | |
258 | other snapshot metadata (images that don't have | |
259 | layering enabled aren't affected) | |
260 | ||
261 | set_size - will adjust the parent overlap down as needed. | |
262 | ||
263 | librbd | |
264 | ^^^^^^ | |
265 | ||
266 | Opening a child image opens its parent (and this will continue | |
267 | recursively as needed). This means that an ImageCtx will contain a | |
268 | pointer to the parent image context. Differing object sizes won't | |
269 | matter, since reading from the parent will go through the parent | |
270 | image context. | |
271 | ||
272 | Discard will need to change for layered images so that it only | |
273 | truncates objects, and does not remove them. If we removed objects, we | |
274 | could not tell if we needed to read them from the parent. | |
275 | ||
276 | A new clone method will be added, which takes the same arguments as | |
277 | create except size (size of the parent image is used). | |
278 | ||
279 | Instead of expanding the rbd_info struct, we will break the metadata | |
280 | retrieval into several API calls. Right now, the only users of | |
281 | rbd_stat() other than 'rbd info' only use it to retrieve image size. |