[ceph.git] / ceph / doc / dev / cephfs-snapshots.rst

CephFS Snapshots
================

CephFS supports snapshots, generally created by invoking mkdir against the
(hidden, special) .snap directory.

Overview
-----------

Generally, snapshots do what they sound like: they create an immutable view
of the filesystem at the point in time they're taken. There are some headline
features that make CephFS snapshots different from what you might expect:

* Arbitrary subtrees. Snapshots are created within any directory you choose,
  and cover all data in the filesystem under that directory.
* Asynchronous. If you create a snapshot, buffered data is flushed out lazily,
  including from other clients. As a result, "creating" the snapshot is
  very fast.

Important Data Structures
-------------------------
* SnapRealm: A `SnapRealm` is created whenever you create a snapshot at a new
  point in the hierarchy (or, when a snapshotted inode is moved outside of its
  parent snapshot). SnapRealms contain an `sr_t srnode`, links to `past_parents`
  and `past_children`, and all `inodes_with_caps` that are part of the snapshot.
  Clients also have a SnapRealm concept that maintains less data but is used to
  associate a `SnapContext` with each open file for writing.
* sr_t: An `sr_t` is the on-disk snapshot metadata. It is part of the containing
  directory and contains sequence counters, timestamps, the list of associated
  snapshot IDs, and `past_parents`.
* snaplink_t: `past_parents` et al are stored on-disk as a `snaplink_t`, holding
  the inode number and first `snapid` of the inode/snapshot referenced.

Creating a snapshot
-------------------
To make a snapshot on directory "/1/2/3/foo", the client invokes "mkdir" on
"/1/2/3/foo/.snaps" directory. This is transmitted to the MDS Server as a
CEPH_MDS_OP_MKSNAP-tagged `MClientRequest`, and initially handled in
Server::handle_client_mksnap(). It allocates a `snapid` from the `SnapServer`,
projects a new inode with the new SnapRealm, and commits it to the MDLog as
usual. When committed, it invokes
`MDCache::do_realm_invalidate_and_update_notify()`, which triggers most of the
real work of the snapshot.

If there were already snapshots above directory "foo" (rooted at "/1", say),
the new SnapRealm adds its most immediate ancestor as a `past_parent` on
creation. After committing to the MDLog, all clients with caps on files in
"/1/2/3/foo/" are notified (MDCache::send_snaps()) of the new SnapRealm, and
update the `SnapContext` they are using with that data. Note that this
*is not* a synchronous part of the snapshot creation!

Updating a snapshot
-------------------
If you delete a snapshot, or move data out of the parent snapshot's hierarchy,
a similar process is followed. Extra code paths check to see if we can break
the `past_parent` links between SnapRealms, or eliminate them entirely.

Generating a SnapContext
------------------------
A RADOS `SnapContext` consists of a snapshot sequence ID (`snapid`) and all
the snapshot IDs that an object is already part of. To generate that list, we
generate a list of all `snapids` associated with the SnapRealm and all its
`past_parents`.

Storing snapshot data
---------------------
File data is stored in RADOS "self-managed" snapshots. Clients are careful to
use the correct `SnapContext` when writing file data to the OSDs.

Storing snapshot metadata
-------------------------
Snapshotted dentries (and their inodes) are stored in-line as part of the
directory they were in at the time of the snapshot. *All dentries* include a
`first` and `last` snapid for which they are valid. (Non-snapshotted dentries
will have their `last` set to CEPH_NOSNAP).

Snapshot writeback
------------------
There is a great deal of code to handle writeback efficiently. When a Client
receives an `MClientSnap` message, it updates the local `SnapRealm`
representation and its links to specific `Inodes`, and generates a `CapSnap`
for the `Inode`. The `CapSnap` is flushed out as part of capability writeback,
and if there is dirty data the `CapSnap` is used to block fresh data writes
until the snapshot is completely flushed to the OSDs.

In the MDS, we generate snapshot-representing dentries as part of the regular
process for flushing them. Dentries with outstanding `CapSnap` data is kept
pinned and in the journal.

Deleting snapshots
------------------
Snapshots are deleted by invoking "rmdir" on the ".snaps" directory they are
rooted in. (Attempts to delete a directory which roots snapshots *will fail*;
you must delete the snapshots first.) Once deleted, they are entered into the
`OSDMap` list of deleted snapshots and the file data is removed by the OSDs.
Metadata is cleaned up as the directory objects are read in and written back
out again.

Hard links
----------
Hard links do not interact well with snapshots. A file is snapshotted when its
primary link is part of a SnapRealm; other links *will not* preserve data.
Generally the location where a file was first created will be its primary link,
but if the original link has been deleted it is not easy (nor always
determnistic) to find which link is now the primary.

Multi-FS
---------
Snapshots and multiiple filesystems don't interact well. Specifically, each
MDS cluster allocates `snapids` independently; if you have multiple filesystems
sharing a single pool (via namespaces), their snapshots *will* collide and
deleting one will result in missing file data for others. (This may even be
invisible, not throwing errors to the user.) If each FS gets its own
pool things probably work, but this isn't tested and may not be true.
Commit	Line	Data
7c673cae FG	1	CephFS Snapshots
	2	================
	3
	4	CephFS supports snapshots, generally created by invoking mkdir against the
	5	(hidden, special) .snap directory.
	6
	7	Overview
	8	-----------
	9
	10	Generally, snapshots do what they sound like: they create an immutable view
	11	of the filesystem at the point in time they're taken. There are some headline
	12	features that make CephFS snapshots different from what you might expect:
	13
	14	* Arbitrary subtrees. Snapshots are created within any directory you choose,
	15	and cover all data in the filesystem under that directory.
	16	* Asynchronous. If you create a snapshot, buffered data is flushed out lazily,
	17	including from other clients. As a result, "creating" the snapshot is
	18	very fast.
	19
	20	Important Data Structures
	21	-------------------------
	22	* SnapRealm: A `SnapRealm` is created whenever you create a snapshot at a new
	23	point in the hierarchy (or, when a snapshotted inode is moved outside of its
	24	parent snapshot). SnapRealms contain an `sr_t srnode`, links to `past_parents`
	25	and `past_children`, and all `inodes_with_caps` that are part of the snapshot.
	26	Clients also have a SnapRealm concept that maintains less data but is used to
	27	associate a `SnapContext` with each open file for writing.
	28	* sr_t: An `sr_t` is the on-disk snapshot metadata. It is part of the containing
	29	directory and contains sequence counters, timestamps, the list of associated
	30	snapshot IDs, and `past_parents`.
	31	* snaplink_t: `past_parents` et al are stored on-disk as a `snaplink_t`, holding
	32	the inode number and first `snapid` of the inode/snapshot referenced.
	33
	34	Creating a snapshot
	35	-------------------
	36	To make a snapshot on directory "/1/2/3/foo", the client invokes "mkdir" on
	37	"/1/2/3/foo/.snaps" directory. This is transmitted to the MDS Server as a
	38	CEPH_MDS_OP_MKSNAP-tagged `MClientRequest`, and initially handled in
	39	Server::handle_client_mksnap(). It allocates a `snapid` from the `SnapServer`,
	40	projects a new inode with the new SnapRealm, and commits it to the MDLog as
	41	usual. When committed, it invokes
	42	`MDCache::do_realm_invalidate_and_update_notify()`, which triggers most of the
	43	real work of the snapshot.
	44
	45	If there were already snapshots above directory "foo" (rooted at "/1", say),
	46	the new SnapRealm adds its most immediate ancestor as a `past_parent` on
	47	creation. After committing to the MDLog, all clients with caps on files in
	48	"/1/2/3/foo/" are notified (MDCache::send_snaps()) of the new SnapRealm, and
	49	update the `SnapContext` they are using with that data. Note that this
	50	is not a synchronous part of the snapshot creation!
	51
	52	Updating a snapshot
	53	-------------------
	54	If you delete a snapshot, or move data out of the parent snapshot's hierarchy,
	55	a similar process is followed. Extra code paths check to see if we can break
	56	the `past_parent` links between SnapRealms, or eliminate them entirely.
	57
	58	Generating a SnapContext
	59	------------------------
	60	A RADOS `SnapContext` consists of a snapshot sequence ID (`snapid`) and all
	61	the snapshot IDs that an object is already part of. To generate that list, we
	62	generate a list of all `snapids` associated with the SnapRealm and all its
	63	`past_parents`.
	64
65	Storing snapshot data
66	---------------------
67	File data is stored in RADOS "self-managed" snapshots. Clients are careful to
68	use the correct `SnapContext` when writing file data to the OSDs.
69
70	Storing snapshot metadata
71	-------------------------
72	Snapshotted dentries (and their inodes) are stored in-line as part of the
73	directory they were in at the time of the snapshot. All dentries include a
74	`first` and `last` snapid for which they are valid. (Non-snapshotted dentries
75	will have their `last` set to CEPH_NOSNAP).
76
77	Snapshot writeback
78	------------------
79	There is a great deal of code to handle writeback efficiently. When a Client
80	receives an `MClientSnap` message, it updates the local `SnapRealm`
81	representation and its links to specific `Inodes`, and generates a `CapSnap`
82	for the `Inode`. The `CapSnap` is flushed out as part of capability writeback,
83	and if there is dirty data the `CapSnap` is used to block fresh data writes
84	until the snapshot is completely flushed to the OSDs.
85
86	In the MDS, we generate snapshot-representing dentries as part of the regular
87	process for flushing them. Dentries with outstanding `CapSnap` data is kept
88	pinned and in the journal.
89
90	Deleting snapshots
91	------------------
92	Snapshots are deleted by invoking "rmdir" on the ".snaps" directory they are
93	rooted in. (Attempts to delete a directory which roots snapshots will fail;
94	you must delete the snapshots first.) Once deleted, they are entered into the
95	`OSDMap` list of deleted snapshots and the file data is removed by the OSDs.
96	Metadata is cleaned up as the directory objects are read in and written back
97	out again.
98
99	Hard links
100	----------
101	Hard links do not interact well with snapshots. A file is snapshotted when its
102	primary link is part of a SnapRealm; other links will not preserve data.
103	Generally the location where a file was first created will be its primary link,
104	but if the original link has been deleted it is not easy (nor always
105	determnistic) to find which link is now the primary.
106
107	Multi-FS
108	---------
109	Snapshots and multiiple filesystems don't interact well. Specifically, each
110	MDS cluster allocates `snapids` independently; if you have multiple filesystems
111	sharing a single pool (via namespaces), their snapshots will collide and
112	deleting one will result in missing file data for others. (This may even be
113	invisible, not throwing errors to the user.) If each FS gets its own
114	pool things probably work, but this isn't tested and may not be true.