]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/rbd-layering.rst
update ceph source to reef 18.2.1
[ceph.git] / ceph / doc / dev / rbd-layering.rst
CommitLineData
7c673cae
FG
1============
2RBD Layering
3============
4
5RBD layering refers to the creation of copy-on-write clones of block
6devices. This allows for fast image creation, for example to clone a
7golden master image of a virtual machine into a new instance. To
8simplify the semantics, you can only create a clone of a snapshot -
9snapshots are always read-only, so the rest of the image is
10unaffected, and there's no possibility of writing to them
11accidentally.
12
13From a user's perspective, a clone is just like any other rbd image.
14You can take snapshots of them, read/write them, resize them, etc.
15There are no restrictions on clones from a user's viewpoint.
16
17Note: the terms `child` and `parent` below mean an rbd image created
18by cloning, and the rbd image snapshot a child was cloned from.
19
20Command line interface
21----------------------
22
23Before cloning a snapshot, you must mark it as protected, to prevent
24it from being deleted while child images refer to it:
25::
26
27 $ rbd snap protect pool/image@snap
28
29Then you can perform the clone:
30::
31
32 $ rbd clone [--parent] pool/parent@snap [--image] pool2/child1
33
34You can create a clone with different object sizes from the parent:
35::
36
37 $ rbd clone --order 25 pool/parent@snap pool2/child2
38
39To delete the parent, you must first mark it unprotected, which checks
40that there are no children left:
41::
42
43 $ rbd snap unprotect pool/image@snap
44 Cannot unprotect: Still in use by pool2/image2
45 $ rbd children pool/image@snap
46 pool2/child1
47 pool2/child2
48 $ rbd flatten pool2/child1
49 $ rbd rm pool2/child2
50 $ rbd snap rm pool/image@snap
51 Cannot remove a protected snapshot: pool/image@snap
52 $ rbd snap unprotect pool/image@snap
53
54Then the snapshot can be deleted like normal:
55::
56
57 $ rbd snap rm pool/image@snap
58
59Implementation
60--------------
61
62Data Flow
63^^^^^^^^^
64
65In the initial implementation, called 'trivial layering', there will
66be no tracking of which objects exist in a clone. A read that hits a
67non-existent object will attempt to read from the parent snapshot, and
68this will continue recursively until an object exists or an image with
69no parent is found. This is done through the normal read path from
70the parent, so differing object sizes between parents and children
71do not matter.
72
73Before a write to an object is performed, the object is checked for
74existence. If it doesn't exist, a copy-up operation is performed,
75which means reading the relevant range of data from the parent
76snapshot and writing it (plus the original write) to the child
77image. To prevent races with multiple writes trying to copy-up the
78same object, this copy-up operation will include an atomic create. If
79the atomic create fails, the original write is done instead. This
80copy-up operation is implemented as a class method so that extra
81metadata can be stored by it in the future. In trivial layering, the
82copy-up operation copies the entire range needed to the child object
83(that is, the full size of the child object). A future optimization
84could make this copy-up more fine-grained.
85
86Another future optimization could be storing a bitmap of which objects
87actually exist in a child. This would obviate the check for existence
88before each write, and let reads go directly to the parent if needed.
89
90These optimizations are discussed in:
91
92http://marc.info/?l=ceph-devel&m=129867273303846
93
94Parent/Child relationships
95^^^^^^^^^^^^^^^^^^^^^^^^^^
96
97Children store a reference to their parent in their header, as a tuple
98of (pool id, image id, snapshot id). This is enough information to
99open the parent and read from it.
100
101In addition to knowing which parent a given image has, we want to be
102able to tell if a protected snapshot still has children. This is
103accomplished with a new per-pool object, `rbd_children`, which maps
104(parent pool id, parent image id, parent snapshot id) to a list of
105child image ids. This is stored in the same pool as the child image
106because the client creating a clone already has read/write access to
107everything in this pool, but may not have write access to the parent's
108pool. This lets a client with read-only access to one pool clone a
109snapshot from that pool into a pool they have full access to. It
110increases the cost of unprotecting an image, since this needs to check
111for children in every pool, but this is a rare operation. It would
112likely only be done before removing old images, which is already much
113more expensive because it involves deleting every data object in the
114image.
115
116Protection
117^^^^^^^^^^
118
119Internally, protection_state is a field in the header object that
120can be in three states. "protected", "unprotected", and
121"unprotecting". The first two are set as the result of "rbd
122protect/unprotect". The "unprotecting" state is set while the "rbd
123unprotect" command checks for any child images. Only snapshots in the
124"protected" state may be cloned, so the "unprotected" state prevents
125a race like:
126
1271. A: walk through all pools, look for clones, find none
1282. B: create a clone
1293. A: unprotect parent
1304. A: rbd snap rm pool/parent@snap
131
132Resizing
133^^^^^^^^
134
135Resizing an rbd image is like truncating a sparse file. New space is
136treated as zeroes, and shrinking an rbd image deletes the contents
137beyond the old bounds. This means that if you have a 10G image full of
138data, and you resize it down to 5G and then up to 10G again, the last
1395G is treated as zeroes (and any objects that held that data were
140removed when the image was shrunk).
141
142Layering complicates this because the absence of an object no longer
143implies it should be treated as zeroes - if the object is part of a
144clone, it may mean that some data needs to be read from the parent.
145
146To preserve the resizing behavior for clones, we need to keep track of
147which objects could be stored in the parent. We can track this as the
148amount of overlap the child has with the parent, since resizing only
149changes the end of an image. When a child is created, its overlap
150is the size of the parent snapshot. On each subsequent resize, the
151overlap is `min(overlap, new_size)`. That is, shrinking the image
152may shrinks the overlap, but increasing the image's size does not
153change the overlap.
154
155Objects that do not exist past the overlap are treated as zeroes.
156Objects that do not exist before that point fall back to reading
157from the parent.
158
159Since this overlap changes over time, we store it as part of the
160metadata for a snapshot as well.
161
162Renaming
163^^^^^^^^
164
165Currently the rbd header object (that stores all the metadata about an
166image) is named after the name of the image. This makes renaming
167disrupt clients who have the image open (such as children reading from
168a parent). To avoid this, we can name the header object by the
169id of the image, which does not change. That is, the name of the
170header object could be `rbd_header.$id`, where $id is a unique id for
171the image in the pool.
172
173When a client opens an image, all it knows is the name. There is
174already a per-pool `rbd_directory` object that maps image names to
175ids, but if we relied on it to get the id, we could not open any
176images in that pool if that single object was unavailable. To avoid
177this dependency, we can store the id of an image in an object called
178`rbd_id.$image_name`, where $image_name is the name of the image. The
179per-pool `rbd_directory` object is still useful for listing all images
180in a pool, however.
181
182Header changes
183--------------
184
185The header needs a few new fields:
186
187* int64_t parent_pool_id
188* string parent_image_id
189* uint64_t parent_snap_id
190* uint64_t overlap (how much of the image may be referring to the parent)
191
192These are stored in a "parent" key, which is only present if the image
193has a parent.
194
195cls_rbd
196^^^^^^^
197
198Some new methods are needed:
199::
200
201 /***************** methods on the rbd header *********************/
202 /**
203 * Sets the parent and overlap keys.
204 * Fails if any of these keys exist, since the image already
205 * had a parent.
206 */
207 set_parent(uint64_t pool_id, string image_id, uint64_t snap_id)
208
209 /**
210 * returns the parent pool id, image id, snap id, and overlap, or -ENOENT
211 * if parent_pool_id does not exist or is -1
212 */
213 get_parent(uint64_t snapid)
214
215 /**
216 * Removes the parent key
217 */
218 remove_parent() // after all parent data is copied to the child
219
220 /*************** methods on the rbd_children object *****************/
221
222 add_child(uint64_t parent_pool_id, string parent_image_id,
223 uint64_t parent_snap_id, string image_id);
224 remove_child(uint64_t parent_pool_id, string parent_image_id,
225 uint64_t parent_snap_id, string image_id);
226 /**
227 * List ids of a given parent
228 */
229 get_children(uint64_t parent_pool_id, string parent_image_id,
230 uint64_t parent_snap_id, uint64_t max_return,
231 string start);
232 /**
233 * list parent
234 */
235 get_parents(uint64_t max_return, uint64_t start_pool_id,
236 string start_image_id, string start_snap_id);
237
238
239 /************ methods on the rbd_id.$image_name object **************/
240
241 set_id(string id)
242 get_id()
243
244 /************** methods on the rbd_directory object *****************/
245
246 dir_get_id(string name);
247 dir_get_name(string id);
248 dir_list(string start_after, uint64_t max_return);
249 dir_add_image(string name, string id);
250 dir_remove_image(string name, string id);
251 dir_rename_image(string src, string dest, string id);
252
253Two existing methods will change if the image supports
254layering:
255::
256
257 snapshot_add - stores current overlap and has_parent with
258 other snapshot metadata (images that don't have
259 layering enabled aren't affected)
260
261 set_size - will adjust the parent overlap down as needed.
262
263librbd
264^^^^^^
265
266Opening a child image opens its parent (and this will continue
267recursively as needed). This means that an ImageCtx will contain a
268pointer to the parent image context. Differing object sizes won't
269matter, since reading from the parent will go through the parent
270image context.
271
272Discard will need to change for layered images so that it only
273truncates objects, and does not remove them. If we removed objects, we
274could not tell if we needed to read them from the parent.
275
276A new clone method will be added, which takes the same arguments as
277create except size (size of the parent image is used).
278
279Instead of expanding the rbd_info struct, we will break the metadata
280retrieval into several API calls. Right now, the only users of
281rbd_stat() other than 'rbd info' only use it to retrieve image size.