]> git.proxmox.com Git - ceph.git/blame - ceph/src/doc/caching.txt
update ceph source to reef 18.1.2
[ceph.git] / ceph / src / doc / caching.txt
CommitLineData
7c673cae
FG
1
2SPANNING TREE PROPERTY
3
4All metadata that exists in the cache is attached directly or
5indirectly to the root inode. That is, if the /usr/bin/vi inode is in
6the cache, then /usr/bin, /usr, and / are too, including the inodes,
7directory objects, and dentries.
8
9
10AUTHORITY
11
12The authority maintains a list of what nodes cache each inode.
13Additionally, each replica is assigned a nonce (initial 0) to
14disambiguate multiple replicas of the same item (see below).
15
16 map<int, int> replicas; // maps replicating mds# to nonce
17
9f95a23c
TL
18The cached_by set _always_ includes all nodes that cache a
19particular object, but may additionally include nodes that used to
7c673cae
FG
20cache it but no longer do. In those cases, an expire message should
21be in transit. That is, we have two invariants:
22
23 1) the authority's replica set will always include all actual
24 replicas, and
25
26 2) cache expiration notices will be reliably delivered to the
27 authority.
28
29The second invariant is particularly important because the presence of
30replicas will pin the metadata object in memory on the authority,
31preventing it from being trimmed from the cache. Notification of
32expiration of the replicas is required to allow previously replicated
33objects from eventually being trimmed from the cache as well.
34
1e59de90 35Each metadata object has a authority bit that indicates whether it is
7c673cae
FG
36authoritative or a replica.
37
38
39REPLICA NONCE
40
41Each replicated object maintains a "nonce" value, issued by the
42authority at the time the replica was created. If the authority has
43already created a replica for the given MDS, the new replica will be
44issues a new (incremented) nonce. This nonce is attached
45to cache expirations, and allows the authority to disambiguate
46expirations when multiple replicas of the same object are created and
47cache expiration is coincident with replication. That is, when an
48old replica is expired from the replicating MDS at the same time that
49a new replica is issued by the authority and the resulting messages
50cross paths, the authority can tell that it was the old replica that
51was expired and effectively ignore the expiration message. The
52replica is removed from the replicas map only if the nonce matches.
53
54
55SUBTREE PARTITION
56
57Authority of the file system namespace is partitioned using a
58subtree-based partitioning strategy. This strategy effectively
59separates directory inodes from directory contents, such that the
60directory contents are the unit of redelegation. That is, if / is
61assigned to mds0 and /usr to mds1, the inode for /usr will be managed
62by mds0 (it is part of the / directory), while the contents of /usr
63(and everything nested beneath it) will be managed by mds1.
64
65The description for this partition exists solely in the collective
66memory of the MDS cluster and in the individual MDS journals. It is
67not described in the regular on-disk metadata structures. This is
68related to the fact that authority delegation is a property of the
69{\it directory} and not the directory's {\it inode}.
70
71Subsequently, if an MDS is authoritative for a directory inode and does
72not yet have any state associated with the directory in its cache,
73then it can assume that it is also authoritative for the directory.
74
75Directory state consists of a data object that describes any cached
76dentries contained in the directory, information about the
77relationship between the cached contents and what appears on disk, and
78any delegation of authority. That is, each CDir object has a dir_auth
79element. Normally dir_auth has a value of AUTH_PARENT, meaning that
80the authority for the directory is the same as the directory's inode.
81When dir_auth specifies another metadata server, that directory is
82point of authority delegation and becomes a {\it subtree root}. A
83CDir is a subtree root iff its dir_auth specifies an MDS id (and is not
84AUTH_PARENT).
85
86 - A dir is a subtree root iff dir_auth != AUTH_PARENT.
87
88 - If dir_auth = AUTH_PARENT then the inode auth == dir auth, but the
89 converse may not be true.
90
91The authority for any metadata object in the cache can be determined
92by following the parent pointers toward the root until a subtree root
93CDir object is reached, at which point the authority is specified by
94its dir_auth.
95
96Each MDS cache maintains a subtree data structure that describes the
97subtree partition for all objects currently in the cache:
98
99 map< CDir*, set<CDir*> > subtrees;
100
101 - A dir will appear in the subtree map (as a key) IFF it is a subtree
102 root.
103
104Each subtree root will have an entry in the map. The map value is a
105set of all other subtree roots nested beneath that point. Nested
106subtree roots effectively bound or prune a subtree. For example, if
107we had the following partition:
108
109 mds0 /
110 mds1 /usr
111 mds0 /usr/local
112 mds0 /home
113
114The subtree map on mds0 would be
115
116 / -> (/usr, /home)
117 /usr/local -> ()
118 /home -> ()
119
120and on mds1:
121
122 /usr -> (/usr/local)
123
124
125AMBIGUOUS DIR_AUTH
126
127While metadata for a subtree is being migrated between two MDS nodes,
128the dir_auth for the subtree root is allowed to be ambiguous. That
129is, it will specify both the old and new MDS ids, indicating that a
130migration is in progress.
131
132If a replicated metadata object is expired from the cache from a
133subtree whose authority is ambiguous, the cache expiration is sent to
134both potential authorities. This ensures that the message will be
135reliably delivered, even if either of those nodes fails. A number of
136alternative strategies were considered. Sending the expiration to the
137old or new authority and having it forwarded if authority has been
138delegated can result in message loss if the forwarding node fails.
139Pinning ambiguous metadata in cache is computationally expensive for
140implementation reasons, and while delaying the transmission of expiration
141messages is difficult to implement because the replicating must send
142the final expiration messages when the subtree authority is
143disambiguated, forcing it to keep certain elements of it cache in
144memory. Although duplicated expirations incurs a small communications
145overhead, the implementation is much simpler.
146
147
148AUTH PINS
149
150Most operations that modify metadata must allow some amount of time to
151pass in order for the operation to be journaled or for communication
152to take place between the object's authority and any replicas. For
153this reason it must not only be pinned in the authority's metadata
154cache, but also be locked such that the object's authority is not
155allowed to change until the operation completes. This is accomplished
156using {\it auth pins}, which increment a reference counter on the
157object in question, as well as all parent metadata objects up to the
158root of the subtree. As long as the pin is in place, it is impossible
159for that subtree (or any fragment of it that contains one or more
160pins) to be migrated to a different MDS node. Pins can be placed on
161both inodes and directories.
162
163Auth pins can only exist for authoritative metadata, because they are
11fdf7f2 164only created if the object is authoritative, and their presence
7c673cae
FG
165prevents the migration of authority.
166
167
168FREEZING
169
170More specifically, auth pins prevent a subtree from being frozen.
171When a subtree is frozen, all updates to metadata are forbidden. This
172includes updates to the replicas map that describes which replicas
173(and nonces) exist for each object.
174
175In order for metadata to be migrated between MDS nodes, it must first
176be frozen. The root of the subtree is initially marked as {\it
177freezing}. This prevents the creation of any new auth pins within the
178subtree. After all existing auth pins are removed, the subtree is
179then marked as {\it frozen}, at which point all updates are
180forbidden. This allows metadata state to be packaged up in a message
181and transmitted to the new authority, without worrying about
182intervening updates.
183
184If the directory at the base of a freezing or frozen subtree is not
185also a subtree root (that is, it has dir_auth == AUTH_PARENT), the
186directory's parent inode is auth pinned.
187
188 - a frozen tree root dir will auth_pin its inode IFF it is auth AND
189 not a subtree root.
190
191This prevents a parent directory from being concurrently frozen, and a
192range of resulting implementation complications relating metadata
193migration.
194
195
196CACHE EXPIRATION FOR EXPORTING SUBTREES
197
198Cache expiration messages that are received for a subtree that is
199being exported are either deferred or handled immediately, based on
11fdf7f2 200the sender and receiver states. The importing MDS will always defer until
7c673cae
FG
201after the export finishes, because the import could fail. The exporting MDS
202processes the expire UNLESS the expiring MDS does not know about the export or
203the exporting MDS is no longer auth.
204Because MDSes get witness notifications on export, this is safe. Either:
205a) The expiring MDS knows about the export, and has sent messages to both
206MDSes involved, or
207b) The expiring MDS did not know about the export at the time the message
208was sent, and so only sent it to the exporting MDS. (This implies that the
209exporting MDS hasn't yet encoded the state to send to the replica MDS.)
210
211When the subtree export completes, deferred expirations are either processed
212(if the MDS is authoritative) or discarded (if it is not). Because either
213the exporting or importing metadata can fail during the migration
214process, the MDS cannot tell whether it will be authoritative or not
215until the process completes.
216
217During a migration, the subtree will first be frozen on both the
218exporter and importer, and then all other replicas will be informed of
219a subtrees ambiguous authority. This ensures that all expirations
220during migration will go to both parties, and nothing will be lost in
221the event of a failure.
222
223
224
225NORMAL MIGRATION
226
227The exporter begins by doing some checks in export_dir() to verify
228that it is permissible to export the subtree at this time. In
229particular, the cluster must not be degraded, the subtree root may not
230be freezing or frozen, and the path must be pinned (\ie not conflicted
231with a rename). If these conditions are met, the subtree root
232directory is temporarily auth pinned, the subtree freeze is initiated,
233and the exporter is committed to the subtree migration, barring an
234intervening failure of the importer or itself.
235
236The MExportDiscover serves simply to ensure that the inode for the
237base directory being exported is open on the destination node. It is
238pinned by the importer to prevent it from being trimmed. This occurs
239before the exporter completes the freeze of the subtree to ensure that
240the importer is able to replicate the necessary metadata. When the
241exporter receives the MDiscoverAck, it allows the freeze to proceed by
242removing its temporary auth pin.
243
244The MExportPrep message then follows to populate the importer with a
245spanning tree that includes all dirs, inodes, and dentries necessary
246to reach any nested subtrees within the exported region. This
247replicates metadata as well, but it is pushed out by the exporter,
248avoiding deadlock with the regular discover and replication process.
249The importer is responsible for opening the bounding directories from
250any third parties authoritative for those subtrees before
251acknowledging. This ensures that the importer has correct dir_auth
252information about where authority is redelegated for all points nested
253beneath the subtree being migrated. While processing the MExportPrep,
254the importer freezes the entire subtree region to prevent any new
255replication or cache expiration.
256
257A warning stage occurs only if the base subtree directory is open by
258nodes other than the importer and exporter. If it is not, then this
259implies that no metadata within or nested beneath the subtree is
260replicated by any node other than the importer an exporter. If it is,
261then a MExportWarning message informs any bystanders that the
262authority for the region is temporarily ambiguous, and lists both the
263exporter and importer as authoritative MDS nodes. In particular,
264bystanders who are trimming items from their cache must send
265MCacheExpire messages to both the old and new authorities. This is
266necessary to ensure that the surviving authority reliably receives all
267expirations even if the importer or exporter fails. While the subtree
268is frozen (on both the importer and exporter), expirations will not be
269immediately processed; instead, they will be queued until the region
270is unfrozen and it can be determined that the node is or is not
271authoritative.
272
273The exporter walks the subtree hierarchy and packages up an MExport
274message containing all metadata and important state (\eg, information
1e59de90 275about metadata replicas). At the same time, the exporter's metadata
7c673cae
FG
276objects are flagged as non-authoritative. The MExport message sends
277the actual subtree metadata to the importer. Upon receipt, the
278importer inserts the data into its cache, marks all objects as
279authoritative, and logs a copy of all metadata in an EImportStart
280journal message. Once that has safely flushed, it replies with an
281MExportAck. The exporter can now log an EExport journal entry, which
282ultimately specifies that the export was a success. In the presence
283of failures, it is the existence of the EExport entry only that
284disambiguates authority during recovery.
285
286Once logged, the exporter will send an MExportNotify to any
287bystanders, informing them that the authority is no longer ambiguous
288and cache expirations should be sent only to the new authority (the
289importer). Once these are acknowledged back to the exporter,
290implicitly flushing the bystander to exporter message streams of any
291stray expiration notices, the exporter unfreezes the subtree, cleans
292up its migration-related state, and sends a final MExportFinish to the
293importer. Upon receipt, the importer logs an EImportFinish(true)
294(noting locally that the export was indeed a success), unfreezes its
20effc67 295subtree, processes any queued cache expirations, and cleans up its
7c673cae
FG
296state.
297
298
299PARTIAL FAILURE RECOVERY
300
301
302
303
304RECOVERY FROM JOURNAL
305
306
307
308
309
310
311
312
313