]> git.proxmox.com Git - ceph.git/blob - ceph/doc/dev/cephfs-snapshots.rst
update ceph source to reef 18.2.1
[ceph.git] / ceph / doc / dev / cephfs-snapshots.rst
1 CephFS Snapshots
2 ================
3
4 CephFS supports snapshots, generally created by invoking mkdir within the
5 ``.snap`` directory. Note this is a hidden, special directory, not visible
6 during a directory listing.
7
8 Overview
9 -----------
10
11 Generally, snapshots do what they sound like: they create an immutable view
12 of the file system at the point in time they're taken. There are some headline
13 features that make CephFS snapshots different from what you might expect:
14
15 * Arbitrary subtrees. Snapshots are created within any directory you choose,
16 and cover all data in the file system under that directory.
17 * Asynchronous. If you create a snapshot, buffered data is flushed out lazily,
18 including from other clients. As a result, "creating" the snapshot is
19 very fast.
20
21 Important Data Structures
22 -------------------------
23 * SnapRealm: A `SnapRealm` is created whenever you create a snapshot at a new
24 point in the hierarchy (or, when a snapshotted inode is move outside of its
25 parent snapshot). SnapRealms contain an `sr_t srnode`, and `inodes_with_caps`
26 that are part of the snapshot. Clients also have a SnapRealm concept that
27 maintains less data but is used to associate a `SnapContext` with each open
28 file for writing.
29 * sr_t: An `sr_t` is the on-disk snapshot metadata. It is part of the containing
30 directory and contains sequence counters, timestamps, the list of associated
31 snapshot IDs, and `past_parent_snaps`.
32 * SnapServer: SnapServer manages snapshot ID allocation, snapshot deletion and
33 tracks list of effective snapshots in the file system. A file system only has
34 one instance of snapserver.
35 * SnapClient: SnapClient is used to communicate with snapserver, each MDS rank
36 has its own snapclient instance. SnapClient also caches effective snapshots
37 locally.
38
39 Creating a snapshot
40 -------------------
41 CephFS snapshot feature is enabled by default on new file system. To enable it
42 on existing file systems, use command below.
43
44 .. code::
45
46 $ ceph fs set <fs_name> allow_new_snaps true
47
48 When snapshots are enabled, all directories in CephFS will have a special
49 ``.snap`` directory. (You may configure a different name with the ``client
50 snapdir`` setting if you wish.)
51
52 To create a CephFS snapshot, create a subdirectory under
53 ``.snap`` with a name of your choice. For example, to create a snapshot on
54 directory "/1/2/3/", invoke ``mkdir /1/2/3/.snap/my-snapshot-name`` .
55
56 .. note::
57 Snapshot names can not start with an underscore ('_'), as these names are
58 reserved for internal usage.
59
60 .. note::
61 Snapshot names can not exceed 240 characters. This is because the MDS makes
62 use of long snapshot names internally, which follow the format:
63 `_<SNAPSHOT-NAME>_<INODE-NUMBER>`. Since filenames in general can't have
64 more than 255 characters, and `<node-id>` takes 13 characters, the long
65 snapshot names can take as much as 255 - 1 - 1 - 13 = 240.
66
67 This is transmitted to the MDS Server as a
68 CEPH_MDS_OP_MKSNAP-tagged `MClientRequest`, and initially handled in
69 Server::handle_client_mksnap(). It allocates a `snapid` from the `SnapServer`,
70 projects a new inode with the new SnapRealm, and commits it to the MDLog as
71 usual. When committed, it invokes
72 `MDCache::do_realm_invalidate_and_update_notify()`, which notifies all clients
73 with caps on files under "/1/2/3/", about the new SnapRealm. When clients get
74 the notifications, they update client-side SnapRealm hierarchy, link files
75 under "/1/2/3/" to the new SnapRealm and generate a `SnapContext` for the
76 new SnapRealm.
77
78 Note that this *is not* a synchronous part of the snapshot creation!
79
80 Updating a snapshot
81 -------------------
82 If you delete a snapshot, a similar process is followed. If you remove an inode
83 out of its parent SnapRealm, the rename code creates a new SnapRealm for the
84 renamed inode (if SnapRealm does not already exist), saves IDs of snapshots that
85 are effective on the original parent SnapRealm into `past_parent_snaps` of the
86 new SnapRealm, then follows a process similar to creating snapshot.
87
88 Generating a SnapContext
89 ------------------------
90 A RADOS `SnapContext` consists of a snapshot sequence ID (`snapid`) and all
91 the snapshot IDs that an object is already part of. To generate that list, we
92 combine `snapids` associated with the SnapRealm and all valid `snapids` in
93 `past_parent_snaps`. Stale `snapids` are filtered out by SnapClient's cached
94 effective snapshots.
95
96 Storing snapshot data
97 ---------------------
98 File data is stored in RADOS "self-managed" snapshots. Clients are careful to
99 use the correct `SnapContext` when writing file data to the OSDs.
100
101 Storing snapshot metadata
102 -------------------------
103 Snapshotted dentries (and their inodes) are stored in-line as part of the
104 directory they were in at the time of the snapshot. *All dentries* include a
105 `first` and `last` snapid for which they are valid. (Non-snapshotted dentries
106 will have their `last` set to CEPH_NOSNAP).
107
108 Snapshot writeback
109 ------------------
110 There is a great deal of code to handle writeback efficiently. When a Client
111 receives an `MClientSnap` message, it updates the local `SnapRealm`
112 representation and its links to specific `Inodes`, and generates a `CapSnap`
113 for the `Inode`. The `CapSnap` is flushed out as part of capability writeback,
114 and if there is dirty data the `CapSnap` is used to block fresh data writes
115 until the snapshot is completely flushed to the OSDs.
116
117 In the MDS, we generate snapshot-representing dentries as part of the regular
118 process for flushing them. Dentries with outstanding `CapSnap` data is kept
119 pinned and in the journal.
120
121 Deleting snapshots
122 ------------------
123 Snapshots are deleted by invoking "rmdir" on the ".snap" directory they are
124 rooted in. (Attempts to delete a directory which roots snapshots *will fail*;
125 you must delete the snapshots first.) Once deleted, they are entered into the
126 `OSDMap` list of deleted snapshots and the file data is removed by the OSDs.
127 Metadata is cleaned up as the directory objects are read in and written back
128 out again.
129
130 Hard links
131 ----------
132 Inode with multiple hard links is moved to a dummy global SnapRealm. The
133 dummy SnapRealm covers all snapshots in the file system. The inode's data
134 will be preserved for any new snapshot. These preserved data will cover
135 snapshots on any linkage of the inode.
136
137 Multi-FS
138 ---------
139 Snapshots and multiple file systems don't interact well. Specifically, each
140 MDS cluster allocates `snapids` independently; if you have multiple file systems
141 sharing a single pool (via namespaces), their snapshots *will* collide and
142 deleting one will result in missing file data for others. (This may even be
143 invisible, not throwing errors to the user.) If each FS gets its own
144 pool things probably work, but this isn't tested and may not be true.
145
146 .. Note:: To avoid snap id collision between mon-managed snapshots and file system
147 snapshots, pools with mon-managed snapshots are not allowed to be attached
148 to a file system. Also, mon-managed snapshots can't be created in pools
149 already attached to a file system either.