]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | CephFS Snapshots |
2 | ================ | |
3 | ||
4 | CephFS supports snapshots, generally created by invoking mkdir against the | |
5 | (hidden, special) .snap directory. | |
6 | ||
7 | Overview | |
8 | ----------- | |
9 | ||
10 | Generally, snapshots do what they sound like: they create an immutable view | |
11 | of the filesystem at the point in time they're taken. There are some headline | |
12 | features that make CephFS snapshots different from what you might expect: | |
13 | ||
14 | * Arbitrary subtrees. Snapshots are created within any directory you choose, | |
15 | and cover all data in the filesystem under that directory. | |
16 | * Asynchronous. If you create a snapshot, buffered data is flushed out lazily, | |
17 | including from other clients. As a result, "creating" the snapshot is | |
18 | very fast. | |
19 | ||
20 | Important Data Structures | |
21 | ------------------------- | |
22 | * SnapRealm: A `SnapRealm` is created whenever you create a snapshot at a new | |
23 | point in the hierarchy (or, when a snapshotted inode is moved outside of its | |
24 | parent snapshot). SnapRealms contain an `sr_t srnode`, links to `past_parents` | |
25 | and `past_children`, and all `inodes_with_caps` that are part of the snapshot. | |
26 | Clients also have a SnapRealm concept that maintains less data but is used to | |
27 | associate a `SnapContext` with each open file for writing. | |
28 | * sr_t: An `sr_t` is the on-disk snapshot metadata. It is part of the containing | |
29 | directory and contains sequence counters, timestamps, the list of associated | |
30 | snapshot IDs, and `past_parents`. | |
31 | * snaplink_t: `past_parents` et al are stored on-disk as a `snaplink_t`, holding | |
32 | the inode number and first `snapid` of the inode/snapshot referenced. | |
33 | ||
34 | Creating a snapshot | |
35 | ------------------- | |
36 | To make a snapshot on directory "/1/2/3/foo", the client invokes "mkdir" on | |
37 | "/1/2/3/foo/.snaps" directory. This is transmitted to the MDS Server as a | |
38 | CEPH_MDS_OP_MKSNAP-tagged `MClientRequest`, and initially handled in | |
39 | Server::handle_client_mksnap(). It allocates a `snapid` from the `SnapServer`, | |
40 | projects a new inode with the new SnapRealm, and commits it to the MDLog as | |
41 | usual. When committed, it invokes | |
42 | `MDCache::do_realm_invalidate_and_update_notify()`, which triggers most of the | |
43 | real work of the snapshot. | |
44 | ||
45 | If there were already snapshots above directory "foo" (rooted at "/1", say), | |
46 | the new SnapRealm adds its most immediate ancestor as a `past_parent` on | |
47 | creation. After committing to the MDLog, all clients with caps on files in | |
48 | "/1/2/3/foo/" are notified (MDCache::send_snaps()) of the new SnapRealm, and | |
49 | update the `SnapContext` they are using with that data. Note that this | |
50 | *is not* a synchronous part of the snapshot creation! | |
51 | ||
52 | Updating a snapshot | |
53 | ------------------- | |
54 | If you delete a snapshot, or move data out of the parent snapshot's hierarchy, | |
55 | a similar process is followed. Extra code paths check to see if we can break | |
56 | the `past_parent` links between SnapRealms, or eliminate them entirely. | |
57 | ||
58 | Generating a SnapContext | |
59 | ------------------------ | |
60 | A RADOS `SnapContext` consists of a snapshot sequence ID (`snapid`) and all | |
61 | the snapshot IDs that an object is already part of. To generate that list, we | |
62 | generate a list of all `snapids` associated with the SnapRealm and all its | |
63 | `past_parents`. | |
64 | ||
65 | Storing snapshot data | |
66 | --------------------- | |
67 | File data is stored in RADOS "self-managed" snapshots. Clients are careful to | |
68 | use the correct `SnapContext` when writing file data to the OSDs. | |
69 | ||
70 | Storing snapshot metadata | |
71 | ------------------------- | |
72 | Snapshotted dentries (and their inodes) are stored in-line as part of the | |
73 | directory they were in at the time of the snapshot. *All dentries* include a | |
74 | `first` and `last` snapid for which they are valid. (Non-snapshotted dentries | |
75 | will have their `last` set to CEPH_NOSNAP). | |
76 | ||
77 | Snapshot writeback | |
78 | ------------------ | |
79 | There is a great deal of code to handle writeback efficiently. When a Client | |
80 | receives an `MClientSnap` message, it updates the local `SnapRealm` | |
81 | representation and its links to specific `Inodes`, and generates a `CapSnap` | |
82 | for the `Inode`. The `CapSnap` is flushed out as part of capability writeback, | |
83 | and if there is dirty data the `CapSnap` is used to block fresh data writes | |
84 | until the snapshot is completely flushed to the OSDs. | |
85 | ||
86 | In the MDS, we generate snapshot-representing dentries as part of the regular | |
87 | process for flushing them. Dentries with outstanding `CapSnap` data is kept | |
88 | pinned and in the journal. | |
89 | ||
90 | Deleting snapshots | |
91 | ------------------ | |
92 | Snapshots are deleted by invoking "rmdir" on the ".snaps" directory they are | |
93 | rooted in. (Attempts to delete a directory which roots snapshots *will fail*; | |
94 | you must delete the snapshots first.) Once deleted, they are entered into the | |
95 | `OSDMap` list of deleted snapshots and the file data is removed by the OSDs. | |
96 | Metadata is cleaned up as the directory objects are read in and written back | |
97 | out again. | |
98 | ||
99 | Hard links | |
100 | ---------- | |
101 | Hard links do not interact well with snapshots. A file is snapshotted when its | |
102 | primary link is part of a SnapRealm; other links *will not* preserve data. | |
103 | Generally the location where a file was first created will be its primary link, | |
104 | but if the original link has been deleted it is not easy (nor always | |
105 | determnistic) to find which link is now the primary. | |
106 | ||
107 | Multi-FS | |
108 | --------- | |
109 | Snapshots and multiiple filesystems don't interact well. Specifically, each | |
110 | MDS cluster allocates `snapids` independently; if you have multiple filesystems | |
111 | sharing a single pool (via namespaces), their snapshots *will* collide and | |
112 | deleting one will result in missing file data for others. (This may even be | |
113 | invisible, not throwing errors to the user.) If each FS gets its own | |
114 | pool things probably work, but this isn't tested and may not be true. |