]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/cephfs-snapshots.rst
bump version to 12.2.12-pve1
[ceph.git] / ceph / doc / dev / cephfs-snapshots.rst
CommitLineData
7c673cae
FG
1CephFS Snapshots
2================
3
4CephFS supports snapshots, generally created by invoking mkdir against the
5(hidden, special) .snap directory.
6
7Overview
8-----------
9
10Generally, snapshots do what they sound like: they create an immutable view
11of the filesystem at the point in time they're taken. There are some headline
12features that make CephFS snapshots different from what you might expect:
13
14* Arbitrary subtrees. Snapshots are created within any directory you choose,
15 and cover all data in the filesystem under that directory.
16* Asynchronous. If you create a snapshot, buffered data is flushed out lazily,
17 including from other clients. As a result, "creating" the snapshot is
18 very fast.
19
20Important Data Structures
21-------------------------
22* SnapRealm: A `SnapRealm` is created whenever you create a snapshot at a new
23 point in the hierarchy (or, when a snapshotted inode is moved outside of its
24 parent snapshot). SnapRealms contain an `sr_t srnode`, links to `past_parents`
25 and `past_children`, and all `inodes_with_caps` that are part of the snapshot.
26 Clients also have a SnapRealm concept that maintains less data but is used to
27 associate a `SnapContext` with each open file for writing.
28* sr_t: An `sr_t` is the on-disk snapshot metadata. It is part of the containing
29 directory and contains sequence counters, timestamps, the list of associated
30 snapshot IDs, and `past_parents`.
31* snaplink_t: `past_parents` et al are stored on-disk as a `snaplink_t`, holding
32 the inode number and first `snapid` of the inode/snapshot referenced.
33
34Creating a snapshot
35-------------------
36To make a snapshot on directory "/1/2/3/foo", the client invokes "mkdir" on
37"/1/2/3/foo/.snaps" directory. This is transmitted to the MDS Server as a
38CEPH_MDS_OP_MKSNAP-tagged `MClientRequest`, and initially handled in
39Server::handle_client_mksnap(). It allocates a `snapid` from the `SnapServer`,
40projects a new inode with the new SnapRealm, and commits it to the MDLog as
41usual. When committed, it invokes
42`MDCache::do_realm_invalidate_and_update_notify()`, which triggers most of the
43real work of the snapshot.
44
45If there were already snapshots above directory "foo" (rooted at "/1", say),
46the new SnapRealm adds its most immediate ancestor as a `past_parent` on
47creation. After committing to the MDLog, all clients with caps on files in
48"/1/2/3/foo/" are notified (MDCache::send_snaps()) of the new SnapRealm, and
49update the `SnapContext` they are using with that data. Note that this
50*is not* a synchronous part of the snapshot creation!
51
52Updating a snapshot
53-------------------
54If you delete a snapshot, or move data out of the parent snapshot's hierarchy,
55a similar process is followed. Extra code paths check to see if we can break
56the `past_parent` links between SnapRealms, or eliminate them entirely.
57
58Generating a SnapContext
59------------------------
60A RADOS `SnapContext` consists of a snapshot sequence ID (`snapid`) and all
61the snapshot IDs that an object is already part of. To generate that list, we
62generate a list of all `snapids` associated with the SnapRealm and all its
63`past_parents`.
64
65Storing snapshot data
66---------------------
67File data is stored in RADOS "self-managed" snapshots. Clients are careful to
68use the correct `SnapContext` when writing file data to the OSDs.
69
70Storing snapshot metadata
71-------------------------
72Snapshotted dentries (and their inodes) are stored in-line as part of the
73directory they were in at the time of the snapshot. *All dentries* include a
74`first` and `last` snapid for which they are valid. (Non-snapshotted dentries
75will have their `last` set to CEPH_NOSNAP).
76
77Snapshot writeback
78------------------
79There is a great deal of code to handle writeback efficiently. When a Client
80receives an `MClientSnap` message, it updates the local `SnapRealm`
81representation and its links to specific `Inodes`, and generates a `CapSnap`
82for the `Inode`. The `CapSnap` is flushed out as part of capability writeback,
83and if there is dirty data the `CapSnap` is used to block fresh data writes
84until the snapshot is completely flushed to the OSDs.
85
86In the MDS, we generate snapshot-representing dentries as part of the regular
87process for flushing them. Dentries with outstanding `CapSnap` data is kept
88pinned and in the journal.
89
90Deleting snapshots
91------------------
92Snapshots are deleted by invoking "rmdir" on the ".snaps" directory they are
93rooted in. (Attempts to delete a directory which roots snapshots *will fail*;
94you must delete the snapshots first.) Once deleted, they are entered into the
95`OSDMap` list of deleted snapshots and the file data is removed by the OSDs.
96Metadata is cleaned up as the directory objects are read in and written back
97out again.
98
99Hard links
100----------
101Hard links do not interact well with snapshots. A file is snapshotted when its
102primary link is part of a SnapRealm; other links *will not* preserve data.
103Generally the location where a file was first created will be its primary link,
104but if the original link has been deleted it is not easy (nor always
105determnistic) to find which link is now the primary.
106
107Multi-FS
108---------
109Snapshots and multiiple filesystems don't interact well. Specifically, each
110MDS cluster allocates `snapids` independently; if you have multiple filesystems
111sharing a single pool (via namespaces), their snapshots *will* collide and
112deleting one will result in missing file data for others. (This may even be
113invisible, not throwing errors to the user.) If each FS gets its own
114pool things probably work, but this isn't tested and may not be true.