[ceph.git] / ceph / doc / dev / rbd-layering.rst

============
RBD Layering
============

RBD layering refers to the creation of copy-on-write clones of block
devices. This allows for fast image creation, for example to clone a
golden master image of a virtual machine into a new instance. To
simplify the semantics, you can only create a clone of a snapshot -
snapshots are always read-only, so the rest of the image is
unaffected, and there's no possibility of writing to them
accidentally.

From a user's perspective, a clone is just like any other rbd image.
You can take snapshots of them, read/write them, resize them, etc.
There are no restrictions on clones from a user's viewpoint.

Note: the terms `child` and `parent` below mean an rbd image created
by cloning, and the rbd image snapshot a child was cloned from.

Command line interface
----------------------

Before cloning a snapshot, you must mark it as protected, to prevent
it from being deleted while child images refer to it:
::

    $ rbd snap protect pool/image@snap

Then you can perform the clone:
::

    $ rbd clone [--parent] pool/parent@snap [--image] pool2/child1

You can create a clone with different object sizes from the parent:
::

    $ rbd clone --order 25 pool/parent@snap pool2/child2

To delete the parent, you must first mark it unprotected, which checks
that there are no children left:
::

    $ rbd snap unprotect pool/image@snap
    Cannot unprotect: Still in use by pool2/image2
    $ rbd children pool/image@snap
    pool2/child1
    pool2/child2
    $ rbd flatten pool2/child1
    $ rbd rm pool2/child2
    $ rbd snap rm pool/image@snap
    Cannot remove a protected snapshot: pool/image@snap
    $ rbd snap unprotect pool/image@snap

Then the snapshot can be deleted like normal:
::

    $ rbd snap rm pool/image@snap

Implementation
--------------

Data Flow
^^^^^^^^^

In the initial implementation, called 'trivial layering', there will
be no tracking of which objects exist in a clone. A read that hits a
non-existent object will attempt to read from the parent snapshot, and
this will continue recursively until an object exists or an image with
no parent is found. This is done through the normal read path from
the parent, so differing object sizes between parents and children
do not matter.

Before a write to an object is performed, the object is checked for
existence. If it doesn't exist, a copy-up operation is performed,
which means reading the relevant range of data from the parent
snapshot and writing it (plus the original write) to the child
image. To prevent races with multiple writes trying to copy-up the
same object, this copy-up operation will include an atomic create. If
the atomic create fails, the original write is done instead. This
copy-up operation is implemented as a class method so that extra
metadata can be stored by it in the future. In trivial layering, the
copy-up operation copies the entire range needed to the child object
(that is, the full size of the child object). A future optimization
could make this copy-up more fine-grained.

Another future optimization could be storing a bitmap of which objects
actually exist in a child. This would obviate the check for existence
before each write, and let reads go directly to the parent if needed.

These optimizations are discussed in:

http://marc.info/?l=ceph-devel&m=129867273303846

Parent/Child relationships
^^^^^^^^^^^^^^^^^^^^^^^^^^

Children store a reference to their parent in their header, as a tuple
of (pool id, image id, snapshot id). This is enough information to
open the parent and read from it.

In addition to knowing which parent a given image has, we want to be
able to tell if a protected snapshot still has children. This is
accomplished with a new per-pool object, `rbd_children`, which maps
(parent pool id, parent image id, parent snapshot id) to a list of
child image ids. This is stored in the same pool as the child image
because the client creating a clone already has read/write access to
everything in this pool, but may not have write access to the parent's
pool. This lets a client with read-only access to one pool clone a
snapshot from that pool into a pool they have full access to. It
increases the cost of unprotecting an image, since this needs to check
for children in every pool, but this is a rare operation. It would
likely only be done before removing old images, which is already much
more expensive because it involves deleting every data object in the
image.

Protection
^^^^^^^^^^

Internally, protection_state is a field in the header object that
can be in three states. "protected", "unprotected", and
"unprotecting". The first two are set as the result of "rbd
protect/unprotect". The "unprotecting" state is set while the "rbd
unprotect" command checks for any child images. Only snapshots in the
"protected" state may be cloned, so the "unprotected" state prevents
a race like:

1. A: walk through all pools, look for clones, find none
2. B: create a clone
3. A: unprotect parent
4. A: rbd snap rm pool/parent@snap

Resizing
^^^^^^^^

Resizing an rbd image is like truncating a sparse file. New space is
treated as zeroes, and shrinking an rbd image deletes the contents
beyond the old bounds. This means that if you have a 10G image full of
data, and you resize it down to 5G and then up to 10G again, the last
5G is treated as zeroes (and any objects that held that data were
removed when the image was shrunk).

Layering complicates this because the absence of an object no longer
implies it should be treated as zeroes - if the object is part of a
clone, it may mean that some data needs to be read from the parent.

To preserve the resizing behavior for clones, we need to keep track of
which objects could be stored in the parent. We can track this as the
amount of overlap the child has with the parent, since resizing only
changes the end of an image. When a child is created, its overlap
is the size of the parent snapshot. On each subsequent resize, the
overlap is `min(overlap, new_size)`. That is, shrinking the image
may shrinks the overlap, but increasing the image's size does not
change the overlap.

Objects that do not exist past the overlap are treated as zeroes.
Objects that do not exist before that point fall back to reading
from the parent.

Since this overlap changes over time, we store it as part of the
metadata for a snapshot as well.

Renaming
^^^^^^^^

Currently the rbd header object (that stores all the metadata about an
image) is named after the name of the image. This makes renaming
disrupt clients who have the image open (such as children reading from
a parent). To avoid this, we can name the header object by the
id of the image, which does not change. That is, the name of the
header object could be `rbd_header.$id`, where $id is a unique id for
the image in the pool.

When a client opens an image, all it knows is the name. There is
already a per-pool `rbd_directory` object that maps image names to
ids, but if we relied on it to get the id, we could not open any
images in that pool if that single object was unavailable. To avoid
this dependency, we can store the id of an image in an object called
`rbd_id.$image_name`, where $image_name is the name of the image. The
per-pool `rbd_directory` object is still useful for listing all images
in a pool, however.

Header changes
--------------

The header needs a few new fields:

* int64_t parent_pool_id
* string parent_image_id
* uint64_t parent_snap_id
* uint64_t overlap (how much of the image may be referring to the parent)

These are stored in a "parent" key, which is only present if the image
has a parent.

cls_rbd
^^^^^^^

Some new methods are needed:
::

    /***************** methods on the rbd header *********************/
    /**
     * Sets the parent and overlap keys.
     * Fails if any of these keys exist, since the image already
     * had a parent.
     */
    set_parent(uint64_t pool_id, string image_id, uint64_t snap_id)

    /**
     * returns the parent pool id, image id, snap id, and overlap, or -ENOENT
     * if parent_pool_id does not exist or is -1
     */
    get_parent(uint64_t snapid)

    /**
     * Removes the parent key
     */
    remove_parent() // after all parent data is copied to the child

    /*************** methods on the rbd_children object *****************/

    add_child(uint64_t parent_pool_id, string parent_image_id,
              uint64_t parent_snap_id, string image_id);
    remove_child(uint64_t parent_pool_id, string parent_image_id,
                 uint64_t parent_snap_id, string image_id);
    /**
     * List ids of a given parent
     */
    get_children(uint64_t parent_pool_id, string parent_image_id,
                 uint64_t parent_snap_id, uint64_t max_return,
                 string start);
    /**
     * list parent
     */
    get_parents(uint64_t max_return, uint64_t start_pool_id,
                string start_image_id, string start_snap_id);


    /************ methods on the rbd_id.$image_name object **************/

    set_id(string id)
    get_id()

    /************** methods on the rbd_directory object *****************/

    dir_get_id(string name);
    dir_get_name(string id);
    dir_list(string start_after, uint64_t max_return);
    dir_add_image(string name, string id);
    dir_remove_image(string name, string id);
    dir_rename_image(string src, string dest, string id);

Two existing methods will change if the image supports
layering:
::

    snapshot_add - stores current overlap and has_parent with
                   other snapshot metadata (images that don't have
                   layering enabled aren't affected)

    set_size     - will adjust the parent overlap down as needed.

librbd
^^^^^^

Opening a child image opens its parent (and this will continue
recursively as needed). This means that an ImageCtx will contain a
pointer to the parent image context. Differing object sizes won't
matter, since reading from the parent will go through the parent
image context.

Discard will need to change for layered images so that it only
truncates objects, and does not remove them. If we removed objects, we
could not tell if we needed to read them from the parent.

A new clone method will be added, which takes the same arguments as
create except size (size of the parent image is used).

Instead of expanding the rbd_info struct, we will break the metadata
retrieval into several API calls.  Right now, the only users of
rbd_stat() other than 'rbd info' only use it to retrieve image size.
Commit	Line	Data
7c673cae FG	1	============
	2	RBD Layering
	3	============
	4
	5	RBD layering refers to the creation of copy-on-write clones of block
	6	devices. This allows for fast image creation, for example to clone a
	7	golden master image of a virtual machine into a new instance. To
	8	simplify the semantics, you can only create a clone of a snapshot -
	9	snapshots are always read-only, so the rest of the image is
	10	unaffected, and there's no possibility of writing to them
	11	accidentally.
	12
	13	From a user's perspective, a clone is just like any other rbd image.
	14	You can take snapshots of them, read/write them, resize them, etc.
	15	There are no restrictions on clones from a user's viewpoint.
	16
	17	Note: the terms `child` and `parent` below mean an rbd image created
	18	by cloning, and the rbd image snapshot a child was cloned from.
	19
	20	Command line interface
	21	----------------------
	22
	23	Before cloning a snapshot, you must mark it as protected, to prevent
	24	it from being deleted while child images refer to it:
	25	::
	26
	27	$ rbd snap protect pool/image@snap
	28
	29	Then you can perform the clone:
	30	::
	31
	32	$ rbd clone [--parent] pool/parent@snap [--image] pool2/child1
	33
	34	You can create a clone with different object sizes from the parent:
	35	::
	36
	37	$ rbd clone --order 25 pool/parent@snap pool2/child2
	38
	39	To delete the parent, you must first mark it unprotected, which checks
	40	that there are no children left:
	41	::
	42
	43	$ rbd snap unprotect pool/image@snap
	44	Cannot unprotect: Still in use by pool2/image2
	45	$ rbd children pool/image@snap
	46	pool2/child1
	47	pool2/child2
	48	$ rbd flatten pool2/child1
	49	$ rbd rm pool2/child2
	50	$ rbd snap rm pool/image@snap
	51	Cannot remove a protected snapshot: pool/image@snap
	52	$ rbd snap unprotect pool/image@snap
	53
	54	Then the snapshot can be deleted like normal:
	55	::
	56
	57	$ rbd snap rm pool/image@snap
	58
	59	Implementation
	60	--------------
	61
	62	Data Flow
	63	^^^^^^^^^
	64
65	In the initial implementation, called 'trivial layering', there will
66	be no tracking of which objects exist in a clone. A read that hits a
67	non-existent object will attempt to read from the parent snapshot, and
68	this will continue recursively until an object exists or an image with
69	no parent is found. This is done through the normal read path from
70	the parent, so differing object sizes between parents and children
71	do not matter.
72
73	Before a write to an object is performed, the object is checked for
74	existence. If it doesn't exist, a copy-up operation is performed,
75	which means reading the relevant range of data from the parent
76	snapshot and writing it (plus the original write) to the child
77	image. To prevent races with multiple writes trying to copy-up the
78	same object, this copy-up operation will include an atomic create. If
79	the atomic create fails, the original write is done instead. This
80	copy-up operation is implemented as a class method so that extra
81	metadata can be stored by it in the future. In trivial layering, the
82	copy-up operation copies the entire range needed to the child object
83	(that is, the full size of the child object). A future optimization
84	could make this copy-up more fine-grained.
85
86	Another future optimization could be storing a bitmap of which objects
87	actually exist in a child. This would obviate the check for existence
88	before each write, and let reads go directly to the parent if needed.
89
90	These optimizations are discussed in:
91
92	http://marc.info/?l=ceph-devel&m=129867273303846
93
94	Parent/Child relationships
95	^^^^^^^^^^^^^^^^^^^^^^^^^^
96
97	Children store a reference to their parent in their header, as a tuple
98	of (pool id, image id, snapshot id). This is enough information to
99	open the parent and read from it.
100
101	In addition to knowing which parent a given image has, we want to be
102	able to tell if a protected snapshot still has children. This is
103	accomplished with a new per-pool object, `rbd_children`, which maps
104	(parent pool id, parent image id, parent snapshot id) to a list of
105	child image ids. This is stored in the same pool as the child image
106	because the client creating a clone already has read/write access to
107	everything in this pool, but may not have write access to the parent's
108	pool. This lets a client with read-only access to one pool clone a
109	snapshot from that pool into a pool they have full access to. It
110	increases the cost of unprotecting an image, since this needs to check
111	for children in every pool, but this is a rare operation. It would
112	likely only be done before removing old images, which is already much
113	more expensive because it involves deleting every data object in the
114	image.
115
116	Protection
117	^^^^^^^^^^
118
119	Internally, protection_state is a field in the header object that
120	can be in three states. "protected", "unprotected", and
121	"unprotecting". The first two are set as the result of "rbd
122	protect/unprotect". The "unprotecting" state is set while the "rbd
123	unprotect" command checks for any child images. Only snapshots in the
124	"protected" state may be cloned, so the "unprotected" state prevents
125	a race like:
126
127	1. A: walk through all pools, look for clones, find none
128	2. B: create a clone
129	3. A: unprotect parent
130	4. A: rbd snap rm pool/parent@snap
131
132	Resizing
133	^^^^^^^^
134
135	Resizing an rbd image is like truncating a sparse file. New space is
136	treated as zeroes, and shrinking an rbd image deletes the contents
137	beyond the old bounds. This means that if you have a 10G image full of
138	data, and you resize it down to 5G and then up to 10G again, the last
139	5G is treated as zeroes (and any objects that held that data were
140	removed when the image was shrunk).
141
142	Layering complicates this because the absence of an object no longer
143	implies it should be treated as zeroes - if the object is part of a
144	clone, it may mean that some data needs to be read from the parent.
145
146	To preserve the resizing behavior for clones, we need to keep track of
147	which objects could be stored in the parent. We can track this as the
148	amount of overlap the child has with the parent, since resizing only
149	changes the end of an image. When a child is created, its overlap
150	is the size of the parent snapshot. On each subsequent resize, the
151	overlap is `min(overlap, new_size)`. That is, shrinking the image
152	may shrinks the overlap, but increasing the image's size does not
153	change the overlap.
154
155	Objects that do not exist past the overlap are treated as zeroes.
156	Objects that do not exist before that point fall back to reading
157	from the parent.
158
159	Since this overlap changes over time, we store it as part of the
160	metadata for a snapshot as well.
161
162	Renaming
163	^^^^^^^^
164
165	Currently the rbd header object (that stores all the metadata about an
166	image) is named after the name of the image. This makes renaming
167	disrupt clients who have the image open (such as children reading from
168	a parent). To avoid this, we can name the header object by the
169	id of the image, which does not change. That is, the name of the
170	header object could be `rbd_header.$id`, where $id is a unique id for
171	the image in the pool.
172
173	When a client opens an image, all it knows is the name. There is
174	already a per-pool `rbd_directory` object that maps image names to
175	ids, but if we relied on it to get the id, we could not open any
176	images in that pool if that single object was unavailable. To avoid
177	this dependency, we can store the id of an image in an object called
178	`rbd_id.$image_name`, where $image_name is the name of the image. The
179	per-pool `rbd_directory` object is still useful for listing all images
180	in a pool, however.
181
182	Header changes
183	--------------
184
185	The header needs a few new fields:
186
187	* int64_t parent_pool_id
188	* string parent_image_id
189	* uint64_t parent_snap_id
190	* uint64_t overlap (how much of the image may be referring to the parent)
191
192	These are stored in a "parent" key, which is only present if the image
193	has a parent.
194
195	cls_rbd
196	^^^^^^^
197
198	Some new methods are needed:
199	::
200
201	/*************** methods on the rbd header *******************/
202	/**
203	* Sets the parent and overlap keys.
204	* Fails if any of these keys exist, since the image already
205	* had a parent.
206	*/
207	set_parent(uint64_t pool_id, string image_id, uint64_t snap_id)
208
209	/**
210	* returns the parent pool id, image id, snap id, and overlap, or -ENOENT
211	* if parent_pool_id does not exist or is -1
212	*/
213	get_parent(uint64_t snapid)
214
215	/**
216	* Removes the parent key
217	*/
218	remove_parent() // after all parent data is copied to the child
219
220	/************* methods on the rbd_children object ***************/
221
222	add_child(uint64_t parent_pool_id, string parent_image_id,
223	uint64_t parent_snap_id, string image_id);
224	remove_child(uint64_t parent_pool_id, string parent_image_id,
225	uint64_t parent_snap_id, string image_id);
226	/**
227	* List ids of a given parent
228	*/
229	get_children(uint64_t parent_pool_id, string parent_image_id,
230	uint64_t parent_snap_id, uint64_t max_return,
231	string start);
232	/**
233	* list parent
234	*/
235	get_parents(uint64_t max_return, uint64_t start_pool_id,
236	string start_image_id, string start_snap_id);
237
238
239	/********** methods on the rbd_id.$image_name object ************/
240
241	set_id(string id)
242	get_id()
243
244	/************ methods on the rbd_directory object ***************/
245
246	dir_get_id(string name);
247	dir_get_name(string id);
248	dir_list(string start_after, uint64_t max_return);
249	dir_add_image(string name, string id);
250	dir_remove_image(string name, string id);
251	dir_rename_image(string src, string dest, string id);
252
253	Two existing methods will change if the image supports
254	layering:
255	::
256
257	snapshot_add - stores current overlap and has_parent with
258	other snapshot metadata (images that don't have
259	layering enabled aren't affected)
260
261	set_size - will adjust the parent overlap down as needed.
262
263	librbd
264	^^^^^^
265
266	Opening a child image opens its parent (and this will continue
267	recursively as needed). This means that an ImageCtx will contain a
268	pointer to the parent image context. Differing object sizes won't
269	matter, since reading from the parent will go through the parent
270	image context.
271
272	Discard will need to change for layered images so that it only
273	truncates objects, and does not remove them. If we removed objects, we
274	could not tell if we needed to read them from the parent.
275
276	A new clone method will be added, which takes the same arguments as
277	create except size (size of the parent image is used).
278
279	Instead of expanding the rbd_info struct, we will break the metadata
280	retrieval into several API calls. Right now, the only users of
281	rbd_stat() other than 'rbd info' only use it to retrieve image size.