]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | |
2 | SPANNING TREE PROPERTY | |
3 | ||
4 | All metadata that exists in the cache is attached directly or | |
5 | indirectly to the root inode. That is, if the /usr/bin/vi inode is in | |
6 | the cache, then /usr/bin, /usr, and / are too, including the inodes, | |
7 | directory objects, and dentries. | |
8 | ||
9 | ||
10 | AUTHORITY | |
11 | ||
12 | The authority maintains a list of what nodes cache each inode. | |
13 | Additionally, each replica is assigned a nonce (initial 0) to | |
14 | disambiguate multiple replicas of the same item (see below). | |
15 | ||
16 | map<int, int> replicas; // maps replicating mds# to nonce | |
17 | ||
9f95a23c TL |
18 | The cached_by set _always_ includes all nodes that cache a |
19 | particular object, but may additionally include nodes that used to | |
7c673cae FG |
20 | cache it but no longer do. In those cases, an expire message should |
21 | be in transit. That is, we have two invariants: | |
22 | ||
23 | 1) the authority's replica set will always include all actual | |
24 | replicas, and | |
25 | ||
26 | 2) cache expiration notices will be reliably delivered to the | |
27 | authority. | |
28 | ||
29 | The second invariant is particularly important because the presence of | |
30 | replicas will pin the metadata object in memory on the authority, | |
31 | preventing it from being trimmed from the cache. Notification of | |
32 | expiration of the replicas is required to allow previously replicated | |
33 | objects from eventually being trimmed from the cache as well. | |
34 | ||
1e59de90 | 35 | Each metadata object has a authority bit that indicates whether it is |
7c673cae FG |
36 | authoritative or a replica. |
37 | ||
38 | ||
39 | REPLICA NONCE | |
40 | ||
41 | Each replicated object maintains a "nonce" value, issued by the | |
42 | authority at the time the replica was created. If the authority has | |
43 | already created a replica for the given MDS, the new replica will be | |
44 | issues a new (incremented) nonce. This nonce is attached | |
45 | to cache expirations, and allows the authority to disambiguate | |
46 | expirations when multiple replicas of the same object are created and | |
47 | cache expiration is coincident with replication. That is, when an | |
48 | old replica is expired from the replicating MDS at the same time that | |
49 | a new replica is issued by the authority and the resulting messages | |
50 | cross paths, the authority can tell that it was the old replica that | |
51 | was expired and effectively ignore the expiration message. The | |
52 | replica is removed from the replicas map only if the nonce matches. | |
53 | ||
54 | ||
55 | SUBTREE PARTITION | |
56 | ||
57 | Authority of the file system namespace is partitioned using a | |
58 | subtree-based partitioning strategy. This strategy effectively | |
59 | separates directory inodes from directory contents, such that the | |
60 | directory contents are the unit of redelegation. That is, if / is | |
61 | assigned to mds0 and /usr to mds1, the inode for /usr will be managed | |
62 | by mds0 (it is part of the / directory), while the contents of /usr | |
63 | (and everything nested beneath it) will be managed by mds1. | |
64 | ||
65 | The description for this partition exists solely in the collective | |
66 | memory of the MDS cluster and in the individual MDS journals. It is | |
67 | not described in the regular on-disk metadata structures. This is | |
68 | related to the fact that authority delegation is a property of the | |
69 | {\it directory} and not the directory's {\it inode}. | |
70 | ||
71 | Subsequently, if an MDS is authoritative for a directory inode and does | |
72 | not yet have any state associated with the directory in its cache, | |
73 | then it can assume that it is also authoritative for the directory. | |
74 | ||
75 | Directory state consists of a data object that describes any cached | |
76 | dentries contained in the directory, information about the | |
77 | relationship between the cached contents and what appears on disk, and | |
78 | any delegation of authority. That is, each CDir object has a dir_auth | |
79 | element. Normally dir_auth has a value of AUTH_PARENT, meaning that | |
80 | the authority for the directory is the same as the directory's inode. | |
81 | When dir_auth specifies another metadata server, that directory is | |
82 | point of authority delegation and becomes a {\it subtree root}. A | |
83 | CDir is a subtree root iff its dir_auth specifies an MDS id (and is not | |
84 | AUTH_PARENT). | |
85 | ||
86 | - A dir is a subtree root iff dir_auth != AUTH_PARENT. | |
87 | ||
88 | - If dir_auth = AUTH_PARENT then the inode auth == dir auth, but the | |
89 | converse may not be true. | |
90 | ||
91 | The authority for any metadata object in the cache can be determined | |
92 | by following the parent pointers toward the root until a subtree root | |
93 | CDir object is reached, at which point the authority is specified by | |
94 | its dir_auth. | |
95 | ||
96 | Each MDS cache maintains a subtree data structure that describes the | |
97 | subtree partition for all objects currently in the cache: | |
98 | ||
99 | map< CDir*, set<CDir*> > subtrees; | |
100 | ||
101 | - A dir will appear in the subtree map (as a key) IFF it is a subtree | |
102 | root. | |
103 | ||
104 | Each subtree root will have an entry in the map. The map value is a | |
105 | set of all other subtree roots nested beneath that point. Nested | |
106 | subtree roots effectively bound or prune a subtree. For example, if | |
107 | we had the following partition: | |
108 | ||
109 | mds0 / | |
110 | mds1 /usr | |
111 | mds0 /usr/local | |
112 | mds0 /home | |
113 | ||
114 | The subtree map on mds0 would be | |
115 | ||
116 | / -> (/usr, /home) | |
117 | /usr/local -> () | |
118 | /home -> () | |
119 | ||
120 | and on mds1: | |
121 | ||
122 | /usr -> (/usr/local) | |
123 | ||
124 | ||
125 | AMBIGUOUS DIR_AUTH | |
126 | ||
127 | While metadata for a subtree is being migrated between two MDS nodes, | |
128 | the dir_auth for the subtree root is allowed to be ambiguous. That | |
129 | is, it will specify both the old and new MDS ids, indicating that a | |
130 | migration is in progress. | |
131 | ||
132 | If a replicated metadata object is expired from the cache from a | |
133 | subtree whose authority is ambiguous, the cache expiration is sent to | |
134 | both potential authorities. This ensures that the message will be | |
135 | reliably delivered, even if either of those nodes fails. A number of | |
136 | alternative strategies were considered. Sending the expiration to the | |
137 | old or new authority and having it forwarded if authority has been | |
138 | delegated can result in message loss if the forwarding node fails. | |
139 | Pinning ambiguous metadata in cache is computationally expensive for | |
140 | implementation reasons, and while delaying the transmission of expiration | |
141 | messages is difficult to implement because the replicating must send | |
142 | the final expiration messages when the subtree authority is | |
143 | disambiguated, forcing it to keep certain elements of it cache in | |
144 | memory. Although duplicated expirations incurs a small communications | |
145 | overhead, the implementation is much simpler. | |
146 | ||
147 | ||
148 | AUTH PINS | |
149 | ||
150 | Most operations that modify metadata must allow some amount of time to | |
151 | pass in order for the operation to be journaled or for communication | |
152 | to take place between the object's authority and any replicas. For | |
153 | this reason it must not only be pinned in the authority's metadata | |
154 | cache, but also be locked such that the object's authority is not | |
155 | allowed to change until the operation completes. This is accomplished | |
156 | using {\it auth pins}, which increment a reference counter on the | |
157 | object in question, as well as all parent metadata objects up to the | |
158 | root of the subtree. As long as the pin is in place, it is impossible | |
159 | for that subtree (or any fragment of it that contains one or more | |
160 | pins) to be migrated to a different MDS node. Pins can be placed on | |
161 | both inodes and directories. | |
162 | ||
163 | Auth pins can only exist for authoritative metadata, because they are | |
11fdf7f2 | 164 | only created if the object is authoritative, and their presence |
7c673cae FG |
165 | prevents the migration of authority. |
166 | ||
167 | ||
168 | FREEZING | |
169 | ||
170 | More specifically, auth pins prevent a subtree from being frozen. | |
171 | When a subtree is frozen, all updates to metadata are forbidden. This | |
172 | includes updates to the replicas map that describes which replicas | |
173 | (and nonces) exist for each object. | |
174 | ||
175 | In order for metadata to be migrated between MDS nodes, it must first | |
176 | be frozen. The root of the subtree is initially marked as {\it | |
177 | freezing}. This prevents the creation of any new auth pins within the | |
178 | subtree. After all existing auth pins are removed, the subtree is | |
179 | then marked as {\it frozen}, at which point all updates are | |
180 | forbidden. This allows metadata state to be packaged up in a message | |
181 | and transmitted to the new authority, without worrying about | |
182 | intervening updates. | |
183 | ||
184 | If the directory at the base of a freezing or frozen subtree is not | |
185 | also a subtree root (that is, it has dir_auth == AUTH_PARENT), the | |
186 | directory's parent inode is auth pinned. | |
187 | ||
188 | - a frozen tree root dir will auth_pin its inode IFF it is auth AND | |
189 | not a subtree root. | |
190 | ||
191 | This prevents a parent directory from being concurrently frozen, and a | |
192 | range of resulting implementation complications relating metadata | |
193 | migration. | |
194 | ||
195 | ||
196 | CACHE EXPIRATION FOR EXPORTING SUBTREES | |
197 | ||
198 | Cache expiration messages that are received for a subtree that is | |
199 | being exported are either deferred or handled immediately, based on | |
11fdf7f2 | 200 | the sender and receiver states. The importing MDS will always defer until |
7c673cae FG |
201 | after the export finishes, because the import could fail. The exporting MDS |
202 | processes the expire UNLESS the expiring MDS does not know about the export or | |
203 | the exporting MDS is no longer auth. | |
204 | Because MDSes get witness notifications on export, this is safe. Either: | |
205 | a) The expiring MDS knows about the export, and has sent messages to both | |
206 | MDSes involved, or | |
207 | b) The expiring MDS did not know about the export at the time the message | |
208 | was sent, and so only sent it to the exporting MDS. (This implies that the | |
209 | exporting MDS hasn't yet encoded the state to send to the replica MDS.) | |
210 | ||
211 | When the subtree export completes, deferred expirations are either processed | |
212 | (if the MDS is authoritative) or discarded (if it is not). Because either | |
213 | the exporting or importing metadata can fail during the migration | |
214 | process, the MDS cannot tell whether it will be authoritative or not | |
215 | until the process completes. | |
216 | ||
217 | During a migration, the subtree will first be frozen on both the | |
218 | exporter and importer, and then all other replicas will be informed of | |
219 | a subtrees ambiguous authority. This ensures that all expirations | |
220 | during migration will go to both parties, and nothing will be lost in | |
221 | the event of a failure. | |
222 | ||
223 | ||
224 | ||
225 | NORMAL MIGRATION | |
226 | ||
227 | The exporter begins by doing some checks in export_dir() to verify | |
228 | that it is permissible to export the subtree at this time. In | |
229 | particular, the cluster must not be degraded, the subtree root may not | |
230 | be freezing or frozen, and the path must be pinned (\ie not conflicted | |
231 | with a rename). If these conditions are met, the subtree root | |
232 | directory is temporarily auth pinned, the subtree freeze is initiated, | |
233 | and the exporter is committed to the subtree migration, barring an | |
234 | intervening failure of the importer or itself. | |
235 | ||
236 | The MExportDiscover serves simply to ensure that the inode for the | |
237 | base directory being exported is open on the destination node. It is | |
238 | pinned by the importer to prevent it from being trimmed. This occurs | |
239 | before the exporter completes the freeze of the subtree to ensure that | |
240 | the importer is able to replicate the necessary metadata. When the | |
241 | exporter receives the MDiscoverAck, it allows the freeze to proceed by | |
242 | removing its temporary auth pin. | |
243 | ||
244 | The MExportPrep message then follows to populate the importer with a | |
245 | spanning tree that includes all dirs, inodes, and dentries necessary | |
246 | to reach any nested subtrees within the exported region. This | |
247 | replicates metadata as well, but it is pushed out by the exporter, | |
248 | avoiding deadlock with the regular discover and replication process. | |
249 | The importer is responsible for opening the bounding directories from | |
250 | any third parties authoritative for those subtrees before | |
251 | acknowledging. This ensures that the importer has correct dir_auth | |
252 | information about where authority is redelegated for all points nested | |
253 | beneath the subtree being migrated. While processing the MExportPrep, | |
254 | the importer freezes the entire subtree region to prevent any new | |
255 | replication or cache expiration. | |
256 | ||
257 | A warning stage occurs only if the base subtree directory is open by | |
258 | nodes other than the importer and exporter. If it is not, then this | |
259 | implies that no metadata within or nested beneath the subtree is | |
260 | replicated by any node other than the importer an exporter. If it is, | |
261 | then a MExportWarning message informs any bystanders that the | |
262 | authority for the region is temporarily ambiguous, and lists both the | |
263 | exporter and importer as authoritative MDS nodes. In particular, | |
264 | bystanders who are trimming items from their cache must send | |
265 | MCacheExpire messages to both the old and new authorities. This is | |
266 | necessary to ensure that the surviving authority reliably receives all | |
267 | expirations even if the importer or exporter fails. While the subtree | |
268 | is frozen (on both the importer and exporter), expirations will not be | |
269 | immediately processed; instead, they will be queued until the region | |
270 | is unfrozen and it can be determined that the node is or is not | |
271 | authoritative. | |
272 | ||
273 | The exporter walks the subtree hierarchy and packages up an MExport | |
274 | message containing all metadata and important state (\eg, information | |
1e59de90 | 275 | about metadata replicas). At the same time, the exporter's metadata |
7c673cae FG |
276 | objects are flagged as non-authoritative. The MExport message sends |
277 | the actual subtree metadata to the importer. Upon receipt, the | |
278 | importer inserts the data into its cache, marks all objects as | |
279 | authoritative, and logs a copy of all metadata in an EImportStart | |
280 | journal message. Once that has safely flushed, it replies with an | |
281 | MExportAck. The exporter can now log an EExport journal entry, which | |
282 | ultimately specifies that the export was a success. In the presence | |
283 | of failures, it is the existence of the EExport entry only that | |
284 | disambiguates authority during recovery. | |
285 | ||
286 | Once logged, the exporter will send an MExportNotify to any | |
287 | bystanders, informing them that the authority is no longer ambiguous | |
288 | and cache expirations should be sent only to the new authority (the | |
289 | importer). Once these are acknowledged back to the exporter, | |
290 | implicitly flushing the bystander to exporter message streams of any | |
291 | stray expiration notices, the exporter unfreezes the subtree, cleans | |
292 | up its migration-related state, and sends a final MExportFinish to the | |
293 | importer. Upon receipt, the importer logs an EImportFinish(true) | |
294 | (noting locally that the export was indeed a success), unfreezes its | |
20effc67 | 295 | subtree, processes any queued cache expirations, and cleans up its |
7c673cae FG |
296 | state. |
297 | ||
298 | ||
299 | PARTIAL FAILURE RECOVERY | |
300 | ||
301 | ||
302 | ||
303 | ||
304 | RECOVERY FROM JOURNAL | |
305 | ||
306 | ||
307 | ||
308 | ||
309 | ||
310 | ||
311 | ||
312 | ||
313 |