]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | CephFS Snapshots |
2 | ================ | |
3 | ||
11fdf7f2 TL |
4 | CephFS supports snapshots, generally created by invoking mkdir within the |
5 | ``.snap`` directory. Note this is a hidden, special directory, not visible | |
6 | during a directory listing. | |
7c673cae FG |
7 | |
8 | Overview | |
9 | ----------- | |
10 | ||
11 | Generally, snapshots do what they sound like: they create an immutable view | |
9f95a23c | 12 | of the file system at the point in time they're taken. There are some headline |
7c673cae FG |
13 | features that make CephFS snapshots different from what you might expect: |
14 | ||
15 | * Arbitrary subtrees. Snapshots are created within any directory you choose, | |
9f95a23c | 16 | and cover all data in the file system under that directory. |
7c673cae FG |
17 | * Asynchronous. If you create a snapshot, buffered data is flushed out lazily, |
18 | including from other clients. As a result, "creating" the snapshot is | |
19 | very fast. | |
20 | ||
21 | Important Data Structures | |
22 | ------------------------- | |
23 | * SnapRealm: A `SnapRealm` is created whenever you create a snapshot at a new | |
11fdf7f2 TL |
24 | point in the hierarchy (or, when a snapshotted inode is move outside of its |
25 | parent snapshot). SnapRealms contain an `sr_t srnode`, and `inodes_with_caps` | |
26 | that are part of the snapshot. Clients also have a SnapRealm concept that | |
27 | maintains less data but is used to associate a `SnapContext` with each open | |
28 | file for writing. | |
7c673cae FG |
29 | * sr_t: An `sr_t` is the on-disk snapshot metadata. It is part of the containing |
30 | directory and contains sequence counters, timestamps, the list of associated | |
11fdf7f2 TL |
31 | snapshot IDs, and `past_parent_snaps`. |
32 | * SnapServer: SnapServer manages snapshot ID allocation, snapshot deletion and | |
9f95a23c | 33 | tracks list of effective snapshots in the file system. A file system only has |
11fdf7f2 TL |
34 | one instance of snapserver. |
35 | * SnapClient: SnapClient is used to communicate with snapserver, each MDS rank | |
36 | has its own snapclient instance. SnapClient also caches effective snapshots | |
37 | locally. | |
7c673cae FG |
38 | |
39 | Creating a snapshot | |
40 | ------------------- | |
9f95a23c TL |
41 | CephFS snapshot feature is enabled by default on new file system. To enable it |
42 | on existing file systems, use command below. | |
11fdf7f2 TL |
43 | |
44 | .. code:: | |
45 | ||
46 | $ ceph fs set <fs_name> allow_new_snaps true | |
47 | ||
48 | When snapshots are enabled, all directories in CephFS will have a special | |
49 | ``.snap`` directory. (You may configure a different name with the ``client | |
50 | snapdir`` setting if you wish.) | |
51 | ||
52 | To create a CephFS snapshot, create a subdirectory under | |
53 | ``.snap`` with a name of your choice. For example, to create a snapshot on | |
54 | directory "/1/2/3/", invoke ``mkdir /1/2/3/.snap/my-snapshot-name`` . | |
55 | ||
56 | This is transmitted to the MDS Server as a | |
7c673cae FG |
57 | CEPH_MDS_OP_MKSNAP-tagged `MClientRequest`, and initially handled in |
58 | Server::handle_client_mksnap(). It allocates a `snapid` from the `SnapServer`, | |
59 | projects a new inode with the new SnapRealm, and commits it to the MDLog as | |
60 | usual. When committed, it invokes | |
11fdf7f2 TL |
61 | `MDCache::do_realm_invalidate_and_update_notify()`, which notifies all clients |
62 | with caps on files under "/1/2/3/", about the new SnapRealm. When clients get | |
63 | the notifications, they update client-side SnapRealm hierarchy, link files | |
64 | under "/1/2/3/" to the new SnapRealm and generate a `SnapContext` for the | |
65 | new SnapRealm. | |
7c673cae | 66 | |
11fdf7f2 | 67 | Note that this *is not* a synchronous part of the snapshot creation! |
7c673cae FG |
68 | |
69 | Updating a snapshot | |
70 | ------------------- | |
11fdf7f2 TL |
71 | If you delete a snapshot, a similar process is followed. If you remove an inode |
72 | out of its parent SnapRealm, the rename code creates a new SnapRealm for the | |
73 | renamed inode (if SnapRealm does not already exist), saves IDs of snapshots that | |
74 | are effective on the original parent SnapRealm into `past_parent_snaps` of the | |
75 | new SnapRealm, then follows a process similar to creating snapshot. | |
7c673cae FG |
76 | |
77 | Generating a SnapContext | |
78 | ------------------------ | |
79 | A RADOS `SnapContext` consists of a snapshot sequence ID (`snapid`) and all | |
80 | the snapshot IDs that an object is already part of. To generate that list, we | |
11fdf7f2 TL |
81 | combine `snapids` associated with the SnapRealm and all valid `snapids` in |
82 | `past_parent_snaps`. Stale `snapids` are filtered out by SnapClient's cached | |
83 | effective snapshots. | |
7c673cae FG |
84 | |
85 | Storing snapshot data | |
86 | --------------------- | |
87 | File data is stored in RADOS "self-managed" snapshots. Clients are careful to | |
88 | use the correct `SnapContext` when writing file data to the OSDs. | |
89 | ||
90 | Storing snapshot metadata | |
91 | ------------------------- | |
92 | Snapshotted dentries (and their inodes) are stored in-line as part of the | |
93 | directory they were in at the time of the snapshot. *All dentries* include a | |
94 | `first` and `last` snapid for which they are valid. (Non-snapshotted dentries | |
95 | will have their `last` set to CEPH_NOSNAP). | |
96 | ||
97 | Snapshot writeback | |
98 | ------------------ | |
99 | There is a great deal of code to handle writeback efficiently. When a Client | |
100 | receives an `MClientSnap` message, it updates the local `SnapRealm` | |
101 | representation and its links to specific `Inodes`, and generates a `CapSnap` | |
102 | for the `Inode`. The `CapSnap` is flushed out as part of capability writeback, | |
103 | and if there is dirty data the `CapSnap` is used to block fresh data writes | |
104 | until the snapshot is completely flushed to the OSDs. | |
105 | ||
106 | In the MDS, we generate snapshot-representing dentries as part of the regular | |
107 | process for flushing them. Dentries with outstanding `CapSnap` data is kept | |
108 | pinned and in the journal. | |
109 | ||
110 | Deleting snapshots | |
111 | ------------------ | |
11fdf7f2 | 112 | Snapshots are deleted by invoking "rmdir" on the ".snap" directory they are |
7c673cae FG |
113 | rooted in. (Attempts to delete a directory which roots snapshots *will fail*; |
114 | you must delete the snapshots first.) Once deleted, they are entered into the | |
115 | `OSDMap` list of deleted snapshots and the file data is removed by the OSDs. | |
116 | Metadata is cleaned up as the directory objects are read in and written back | |
117 | out again. | |
118 | ||
119 | Hard links | |
120 | ---------- | |
11fdf7f2 | 121 | Inode with multiple hard links is moved to a dummy global SnapRealm. The |
9f95a23c | 122 | dummy SnapRealm covers all snapshots in the file system. The inode's data |
11fdf7f2 TL |
123 | will be preserved for any new snapshot. These preserved data will cover |
124 | snapshots on any linkage of the inode. | |
7c673cae FG |
125 | |
126 | Multi-FS | |
127 | --------- | |
9f95a23c TL |
128 | Snapshots and multiple file systems don't interact well. Specifically, each |
129 | MDS cluster allocates `snapids` independently; if you have multiple file systems | |
7c673cae FG |
130 | sharing a single pool (via namespaces), their snapshots *will* collide and |
131 | deleting one will result in missing file data for others. (This may even be | |
132 | invisible, not throwing errors to the user.) If each FS gets its own | |
133 | pool things probably work, but this isn't tested and may not be true. |