]>
Commit | Line | Data |
---|---|---|
11fdf7f2 TL |
1 | |
2 | .. _disaster-recovery-experts: | |
3 | ||
4 | Advanced: Metadata repair tools | |
5 | =============================== | |
6 | ||
7 | .. warning:: | |
8 | ||
9 | If you do not have expert knowledge of CephFS internals, you will | |
10 | need to seek assistance before using any of these tools. | |
11 | ||
12 | The tools mentioned here can easily cause damage as well as fixing it. | |
13 | ||
14 | It is essential to understand exactly what has gone wrong with your | |
9f95a23c | 15 | file system before attempting to repair it. |
11fdf7f2 TL |
16 | |
17 | If you do not have access to professional support for your cluster, | |
18 | consult the ceph-users mailing list or the #ceph IRC channel. | |
19 | ||
20 | ||
21 | Journal export | |
22 | -------------- | |
23 | ||
24 | Before attempting dangerous operations, make a copy of the journal like so: | |
25 | ||
26 | :: | |
27 | ||
28 | cephfs-journal-tool journal export backup.bin | |
29 | ||
30 | Note that this command may not always work if the journal is badly corrupted, | |
31 | in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902). | |
32 | ||
33 | ||
34 | Dentry recovery from journal | |
35 | ---------------------------- | |
36 | ||
37 | If a journal is damaged or for any reason an MDS is incapable of replaying it, | |
38 | attempt to recover what file metadata we can like so: | |
39 | ||
40 | :: | |
41 | ||
42 | cephfs-journal-tool event recover_dentries summary | |
43 | ||
44 | This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks. | |
45 | ||
46 | This command will write any inodes/dentries recoverable from the journal | |
47 | into the backing store, if these inodes/dentries are higher-versioned | |
48 | than the previous contents of the backing store. If any regions of the journal | |
49 | are missing/damaged, they will be skipped. | |
50 | ||
51 | Note that in addition to writing out dentries and inodes, this command will update | |
52 | the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers | |
53 | are now in use. In simple cases, this will result in an entirely valid backing | |
54 | store state. | |
55 | ||
56 | .. warning:: | |
57 | ||
58 | The resulting state of the backing store is not guaranteed to be self-consistent, | |
59 | and an online MDS scrub will be required afterwards. The journal contents | |
60 | will not be modified by this command, you should truncate the journal | |
61 | separately after recovering what you can. | |
62 | ||
63 | Journal truncation | |
64 | ------------------ | |
65 | ||
66 | If the journal is corrupt or MDSs cannot replay it for any reason, you can | |
67 | truncate it like so: | |
68 | ||
69 | :: | |
70 | ||
9f95a23c TL |
71 | cephfs-journal-tool [--rank=N] journal reset |
72 | ||
73 | Specify the MDS rank using the ``--rank`` option when the file system has/had | |
74 | multiple active MDS. | |
11fdf7f2 TL |
75 | |
76 | .. warning:: | |
77 | ||
78 | Resetting the journal *will* lose metadata unless you have extracted | |
79 | it by other means such as ``recover_dentries``. It is likely to leave | |
80 | some orphaned objects in the data pool. It may result in re-allocation | |
81 | of already-written inodes, such that permissions rules could be violated. | |
82 | ||
83 | MDS table wipes | |
84 | --------------- | |
85 | ||
86 | After the journal has been reset, it may no longer be consistent with respect | |
87 | to the contents of the MDS tables (InoTable, SessionMap, SnapServer). | |
88 | ||
89 | To reset the SessionMap (erase all sessions), use: | |
90 | ||
91 | :: | |
92 | ||
93 | cephfs-table-tool all reset session | |
94 | ||
95 | This command acts on the tables of all 'in' MDS ranks. Replace 'all' with an MDS | |
96 | rank to operate on that rank only. | |
97 | ||
98 | The session table is the table most likely to need resetting, but if you know you | |
99 | also need to reset the other tables then replace 'session' with 'snap' or 'inode'. | |
100 | ||
101 | MDS map reset | |
102 | ------------- | |
103 | ||
9f95a23c | 104 | Once the in-RADOS state of the file system (i.e. contents of the metadata pool) |
11fdf7f2 TL |
105 | is somewhat recovered, it may be necessary to update the MDS map to reflect |
106 | the contents of the metadata pool. Use the following command to reset the MDS | |
107 | map to a single MDS: | |
108 | ||
109 | :: | |
110 | ||
111 | ceph fs reset <fs name> --yes-i-really-mean-it | |
112 | ||
113 | Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored: | |
114 | as a result it is possible for this to result in data loss. | |
115 | ||
116 | One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'. The | |
117 | key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such | |
118 | that it would overwrite any existing root inode on disk and orphan any existing files. In | |
119 | contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS | |
120 | daemon to claim the rank will go ahead and use the existing in-RADOS metadata. | |
121 | ||
122 | Recovery from missing metadata objects | |
123 | -------------------------------------- | |
124 | ||
125 | Depending on what objects are missing or corrupt, you may need to | |
126 | run various commands to regenerate default versions of the | |
127 | objects. | |
128 | ||
129 | :: | |
130 | ||
131 | # Session table | |
132 | cephfs-table-tool 0 reset session | |
133 | # SnapServer | |
134 | cephfs-table-tool 0 reset snap | |
135 | # InoTable | |
136 | cephfs-table-tool 0 reset inode | |
137 | # Journal | |
138 | cephfs-journal-tool --rank=0 journal reset | |
139 | # Root inodes ("/" and MDS directory) | |
140 | cephfs-data-scan init | |
141 | ||
142 | Finally, you can regenerate metadata objects for missing files | |
143 | and directories based on the contents of a data pool. This is | |
144 | a three-phase process. First, scanning *all* objects to calculate | |
145 | size and mtime metadata for inodes. Second, scanning the first | |
146 | object from every file to collect this metadata and inject it into | |
147 | the metadata pool. Third, checking inode linkages and fixing found | |
148 | errors. | |
149 | ||
150 | :: | |
151 | ||
152 | cephfs-data-scan scan_extents <data pool> | |
153 | cephfs-data-scan scan_inodes <data pool> | |
154 | cephfs-data-scan scan_links | |
155 | ||
156 | 'scan_extents' and 'scan_inodes' commands may take a *very long* time | |
157 | if there are many files or very large files in the data pool. | |
158 | ||
159 | To accelerate the process, run multiple instances of the tool. | |
160 | ||
161 | Decide on a number of workers, and pass each worker a number within | |
162 | the range 0-(worker_m - 1). | |
163 | ||
164 | The example below shows how to run 4 workers simultaneously: | |
165 | ||
166 | :: | |
167 | ||
168 | # Worker 0 | |
169 | cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool> | |
170 | # Worker 1 | |
171 | cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool> | |
172 | # Worker 2 | |
173 | cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool> | |
174 | # Worker 3 | |
175 | cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool> | |
176 | ||
177 | # Worker 0 | |
178 | cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool> | |
179 | # Worker 1 | |
180 | cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool> | |
181 | # Worker 2 | |
182 | cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool> | |
183 | # Worker 3 | |
184 | cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool> | |
185 | ||
186 | It is **important** to ensure that all workers have completed the | |
187 | scan_extents phase before any workers enter the scan_inodes phase. | |
188 | ||
189 | After completing the metadata recovery, you may want to run cleanup | |
190 | operation to delete ancillary data geneated during recovery. | |
191 | ||
192 | :: | |
193 | ||
194 | cephfs-data-scan cleanup <data pool> | |
195 | ||
196 | ||
197 | ||
198 | Using an alternate metadata pool for recovery | |
199 | --------------------------------------------- | |
200 | ||
201 | .. warning:: | |
202 | ||
203 | There has not been extensive testing of this procedure. It should be | |
204 | undertaken with great care. | |
205 | ||
9f95a23c | 206 | If an existing file system is damaged and inoperative, it is possible to create |
f67539c2 TL |
207 | a fresh metadata pool and attempt to reconstruct the file system metadata into |
208 | this new pool, leaving the old metadata in place. This could be used to make a | |
209 | safer attempt at recovery since the existing metadata pool would not be | |
210 | modified. | |
11fdf7f2 TL |
211 | |
212 | .. caution:: | |
213 | ||
214 | During this process, multiple metadata pools will contain data referring to | |
215 | the same data pool. Extreme caution must be exercised to avoid changing the | |
216 | data pool contents while this is the case. Once recovery is complete, the | |
f67539c2 | 217 | damaged metadata pool should be archived or deleted. |
11fdf7f2 | 218 | |
f67539c2 TL |
219 | To begin, the existing file system should be taken down, if not done already, |
220 | to prevent further modification of the data pool. Unmount all clients and then | |
221 | mark the file system failed: | |
222 | ||
223 | :: | |
224 | ||
225 | ceph fs fail <fs_name> | |
226 | ||
227 | Next, create a recovery file system in which we will populate a new metadata pool | |
228 | backed by the original data pool. | |
11fdf7f2 TL |
229 | |
230 | :: | |
231 | ||
232 | ceph fs flag set enable_multiple true --yes-i-really-mean-it | |
f67539c2 TL |
233 | ceph osd pool create cephfs_recovery_meta |
234 | ceph fs new cephfs_recovery recovery <data_pool> --allow-dangerous-metadata-overlay | |
235 | ||
236 | ||
237 | The recovery file system starts with an MDS rank that will initialize the new | |
238 | metadata pool with some metadata. This is necessary to bootstrap recovery. | |
239 | However, now we will take the MDS down as we do not want it interacting with | |
240 | the metadata pool further. | |
241 | ||
242 | :: | |
243 | ||
244 | ceph fs fail cephfs_recovery | |
11fdf7f2 | 245 | |
f67539c2 | 246 | Next, we will reset the initial metadata the MDS created: |
11fdf7f2 TL |
247 | |
248 | :: | |
249 | ||
f67539c2 TL |
250 | cephfs-table-tool cephfs_recovery:all reset session |
251 | cephfs-table-tool cephfs_recovery:all reset snap | |
252 | cephfs-table-tool cephfs_recovery:all reset inode | |
253 | ||
254 | Now perform the recovery of the metadata pool from the data pool: | |
255 | ||
256 | :: | |
257 | ||
258 | cephfs-data-scan init --force-init --filesystem cephfs_recovery --alternate-pool cephfs_recovery_meta | |
259 | cephfs-data-scan scan_extents --alternate-pool cephfs_recovery_meta --filesystem <fs_name> <data_pool> | |
260 | cephfs-data-scan scan_inodes --alternate-pool cephfs_recovery_meta --filesystem <fs_name> --force-corrupt <data_pool> | |
261 | cephfs-data-scan scan_links --filesystem cephfs_recovery | |
262 | ||
263 | .. note:: | |
264 | ||
265 | Each scan procedure above goes through the entire data pool. This may take a | |
266 | significant amount of time. See the previous section on how to distribute | |
267 | this task among workers. | |
11fdf7f2 | 268 | |
9f95a23c | 269 | If the damaged file system contains dirty journal data, it may be recovered next |
11fdf7f2 TL |
270 | with: |
271 | ||
272 | :: | |
273 | ||
f67539c2 TL |
274 | cephfs-journal-tool --rank=<fs_name>:0 event recover_dentries list --alternate-pool cephfs_recovery_meta |
275 | cephfs-journal-tool --rank cephfs_recovery:0 journal reset --force | |
11fdf7f2 TL |
276 | |
277 | After recovery, some recovered directories will have incorrect statistics. | |
f67539c2 TL |
278 | Ensure the parameters ``mds_verify_scatter`` and ``mds_debug_scatterstat`` are |
279 | set to false (the default) to prevent the MDS from checking the statistics: | |
280 | ||
281 | :: | |
282 | ||
283 | ceph config rm mds mds_verify_scatter | |
284 | ceph config rm mds mds_debug_scatterstat | |
285 | ||
286 | (Note, the config may also have been set globally or via a ceph.conf file.) | |
287 | Now, allow an MDS to join the recovery file system: | |
288 | ||
289 | :: | |
290 | ||
291 | ceph fs set cephfs_recovery joinable true | |
292 | ||
293 | Finally, run a forward :doc:`scrub </cephfs/scrub>` to repair the statistics. | |
294 | Ensure you have an MDS running and issue: | |
11fdf7f2 TL |
295 | |
296 | :: | |
297 | ||
f67539c2 TL |
298 | ceph fs status # get active MDS |
299 | ceph tell mds.<id> scrub start / recursive repair | |
300 | ||
301 | .. note:: | |
302 | ||
303 | Symbolic links are recovered as empty regular files. `Symbolic link recovery | |
304 | <https://tracker.ceph.com/issues/46166>`_ is scheduled to be supported in | |
305 | Pacific. | |
306 | ||
307 | It is recommended to migrate any data from the recovery file system as soon as | |
308 | possible. Do not restore the old file system while the recovery file system is | |
309 | operational. | |
9f95a23c TL |
310 | |
311 | .. note:: | |
312 | ||
f67539c2 TL |
313 | If the data pool is also corrupt, some files may not be restored because |
314 | backtrace information is lost. If any data objects are missing (due to | |
315 | issues like lost Placement Groups on the data pool), the recovered files | |
316 | will contain holes in place of the missing data. | |
9f95a23c | 317 | |
f67539c2 | 318 | .. _Symbolic link recovery: https://tracker.ceph.com/issues/46166 |