]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | |
2 | Disaster recovery | |
3 | ================= | |
4 | ||
5 | .. danger:: | |
6 | ||
7 | The notes in this section are aimed at experts, making a best effort | |
8 | to recovery what they can from damaged filesystems. These steps | |
9 | have the potential to make things worse as well as better. If you | |
10 | are unsure, do not proceed. | |
11 | ||
12 | ||
13 | Journal export | |
14 | -------------- | |
15 | ||
16 | Before attempting dangerous operations, make a copy of the journal like so: | |
17 | ||
18 | :: | |
19 | ||
20 | cephfs-journal-tool journal export backup.bin | |
21 | ||
22 | Note that this command may not always work if the journal is badly corrupted, | |
23 | in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902). | |
24 | ||
25 | ||
26 | Dentry recovery from journal | |
27 | ---------------------------- | |
28 | ||
29 | If a journal is damaged or for any reason an MDS is incapable of replaying it, | |
30 | attempt to recover what file metadata we can like so: | |
31 | ||
32 | :: | |
33 | ||
34 | cephfs-journal-tool event recover_dentries summary | |
35 | ||
36 | This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks. | |
37 | ||
38 | This command will write any inodes/dentries recoverable from the journal | |
39 | into the backing store, if these inodes/dentries are higher-versioned | |
40 | than the previous contents of the backing store. If any regions of the journal | |
41 | are missing/damaged, they will be skipped. | |
42 | ||
43 | Note that in addition to writing out dentries and inodes, this command will update | |
44 | the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers | |
45 | are now in use. In simple cases, this will result in an entirely valid backing | |
46 | store state. | |
47 | ||
48 | .. warning:: | |
49 | ||
50 | The resulting state of the backing store is not guaranteed to be self-consistent, | |
51 | and an online MDS scrub will be required afterwards. The journal contents | |
52 | will not be modified by this command, you should truncate the journal | |
53 | separately after recovering what you can. | |
54 | ||
55 | Journal truncation | |
56 | ------------------ | |
57 | ||
58 | If the journal is corrupt or MDSs cannot replay it for any reason, you can | |
59 | truncate it like so: | |
60 | ||
61 | :: | |
62 | ||
63 | cephfs-journal-tool journal reset | |
64 | ||
65 | .. warning:: | |
66 | ||
67 | Resetting the journal *will* lose metadata unless you have extracted | |
68 | it by other means such as ``recover_dentries``. It is likely to leave | |
69 | some orphaned objects in the data pool. It may result in re-allocation | |
70 | of already-written inodes, such that permissions rules could be violated. | |
71 | ||
72 | MDS table wipes | |
73 | --------------- | |
74 | ||
75 | After the journal has been reset, it may no longer be consistent with respect | |
76 | to the contents of the MDS tables (InoTable, SessionMap, SnapServer). | |
77 | ||
78 | To reset the SessionMap (erase all sessions), use: | |
79 | ||
80 | :: | |
81 | ||
82 | cephfs-table-tool all reset session | |
83 | ||
84 | This command acts on the tables of all 'in' MDS ranks. Replace 'all' with an MDS | |
85 | rank to operate on that rank only. | |
86 | ||
87 | The session table is the table most likely to need resetting, but if you know you | |
88 | also need to reset the other tables then replace 'session' with 'snap' or 'inode'. | |
89 | ||
90 | MDS map reset | |
91 | ------------- | |
92 | ||
93 | Once the in-RADOS state of the filesystem (i.e. contents of the metadata pool) | |
94 | is somewhat recovered, it may be necessary to update the MDS map to reflect | |
95 | the contents of the metadata pool. Use the following command to reset the MDS | |
96 | map to a single MDS: | |
97 | ||
98 | :: | |
99 | ||
100 | ceph fs reset <fs name> --yes-i-really-mean-it | |
101 | ||
102 | Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored: | |
103 | as a result it is possible for this to result in data loss. | |
104 | ||
105 | One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'. The | |
106 | key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such | |
107 | that it would overwrite any existing root inode on disk and orphan any existing files. In | |
108 | contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS | |
109 | daemon to claim the rank will go ahead and use the existing in-RADOS metadata. | |
110 | ||
111 | Recovery from missing metadata objects | |
112 | -------------------------------------- | |
113 | ||
114 | Depending on what objects are missing or corrupt, you may need to | |
115 | run various commands to regenerate default versions of the | |
116 | objects. | |
117 | ||
118 | :: | |
119 | ||
120 | # Session table | |
121 | cephfs-table-tool 0 reset session | |
122 | # SnapServer | |
123 | cephfs-table-tool 0 reset snap | |
124 | # InoTable | |
125 | cephfs-table-tool 0 reset inode | |
126 | # Journal | |
127 | cephfs-journal-tool --rank=0 journal reset | |
128 | # Root inodes ("/" and MDS directory) | |
129 | cephfs-data-scan init | |
130 | ||
131 | Finally, you can regenerate metadata objects for missing files | |
132 | and directories based on the contents of a data pool. This is | |
c07f9fc5 | 133 | a three-phase process. First, scanning *all* objects to calculate |
7c673cae | 134 | size and mtime metadata for inodes. Second, scanning the first |
c07f9fc5 FG |
135 | object from every file to collect this metadata and inject it into |
136 | the metadata pool. Third, checking inode linkages and fixing found | |
137 | errors. | |
7c673cae FG |
138 | |
139 | :: | |
140 | ||
141 | cephfs-data-scan scan_extents <data pool> | |
142 | cephfs-data-scan scan_inodes <data pool> | |
c07f9fc5 | 143 | cephfs-data-scan scan_links |
7c673cae | 144 | |
c07f9fc5 FG |
145 | 'scan_extents' and 'scan_inodes' commands may take a *very long* time |
146 | if there are many files or very large files in the data pool. | |
7c673cae FG |
147 | |
148 | To accelerate the process, run multiple instances of the tool. | |
149 | ||
150 | Decide on a number of workers, and pass each worker a number within | |
151 | the range 0-(worker_m - 1). | |
152 | ||
153 | The example below shows how to run 4 workers simultaneously: | |
154 | ||
155 | :: | |
156 | ||
157 | # Worker 0 | |
158 | cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool> | |
159 | # Worker 1 | |
160 | cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool> | |
161 | # Worker 2 | |
162 | cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool> | |
163 | # Worker 3 | |
164 | cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool> | |
165 | ||
166 | # Worker 0 | |
167 | cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool> | |
168 | # Worker 1 | |
169 | cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool> | |
170 | # Worker 2 | |
171 | cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool> | |
172 | # Worker 3 | |
173 | cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool> | |
174 | ||
175 | It is **important** to ensure that all workers have completed the | |
176 | scan_extents phase before any workers enter the scan_inodes phase. | |
177 | ||
178 | After completing the metadata recovery, you may want to run cleanup | |
179 | operation to delete ancillary data geneated during recovery. | |
180 | ||
181 | :: | |
182 | ||
183 | cephfs-data-scan cleanup <data pool> | |
184 | ||
185 | Finding files affected by lost data PGs | |
186 | --------------------------------------- | |
187 | ||
188 | Losing a data PG may affect many files. Files are split into many objects, | |
189 | so identifying which files are affected by loss of particular PGs requires | |
190 | a full scan over all object IDs that may exist within the size of a file. | |
191 | This type of scan may be useful for identifying which files require | |
192 | restoring from a backup. | |
193 | ||
194 | .. danger:: | |
195 | ||
196 | This command does not repair any metadata, so when restoring files in | |
197 | this case you must *remove* the damaged file, and replace it in order | |
198 | to have a fresh inode. Do not overwrite damaged files in place. | |
199 | ||
200 | If you know that objects have been lost from PGs, use the ``pg_files`` | |
201 | subcommand to scan for files that may have been damaged as a result: | |
202 | ||
203 | :: | |
204 | ||
205 | cephfs-data-scan pg_files <path> <pg id> [<pg id>...] | |
206 | ||
207 | For example, if you have lost data from PGs 1.4 and 4.5, and you would like | |
208 | to know which files under /home/bob might have been damaged: | |
209 | ||
210 | :: | |
211 | ||
212 | cephfs-data-scan pg_files /home/bob 1.4 4.5 | |
213 | ||
214 | The output will be a list of paths to potentially damaged files, one | |
215 | per line. | |
216 | ||
217 | Note that this command acts as a normal CephFS client to find all the | |
218 | files in the filesystem and read their layouts, so the MDS must be | |
219 | up and running. | |
220 | ||
221 | Using an alternate metadata pool for recovery | |
222 | --------------------------------------------- | |
223 | ||
224 | .. warning:: | |
225 | ||
226 | There has not been extensive testing of this procedure. It should be | |
227 | undertaken with great care. | |
228 | ||
229 | If an existing filesystem is damaged and inoperative, it is possible to create | |
230 | a fresh metadata pool and attempt to reconstruct the filesystem metadata | |
231 | into this new pool, leaving the old metadata in place. This could be used to | |
232 | make a safer attempt at recovery since the existing metadata pool would not be | |
233 | overwritten. | |
234 | ||
235 | .. caution:: | |
236 | ||
237 | During this process, multiple metadata pools will contain data referring to | |
238 | the same data pool. Extreme caution must be exercised to avoid changing the | |
239 | data pool contents while this is the case. Once recovery is complete, the | |
240 | damaged metadata pool should be deleted. | |
241 | ||
242 | To begin this process, first create the fresh metadata pool and initialize | |
243 | it with empty file system data structures: | |
244 | ||
245 | :: | |
246 | ||
247 | ceph fs flag set enable_multiple true --yes-i-really-mean-it | |
248 | ceph osd pool create recovery <pg-num> replicated <crush-ruleset-name> | |
249 | ceph fs new recovery-fs recovery <data pool> --allow-dangerous-metadata-overlay | |
250 | cephfs-data-scan init --force-init --filesystem recovery-fs --alternate-pool recovery | |
c07f9fc5 | 251 | ceph fs reset recovery-fs --yes-i-really-mean-it |
7c673cae FG |
252 | cephfs-table-tool recovery-fs:all reset session |
253 | cephfs-table-tool recovery-fs:all reset snap | |
254 | cephfs-table-tool recovery-fs:all reset inode | |
255 | ||
256 | Next, run the recovery toolset using the --alternate-pool argument to output | |
257 | results to the alternate pool: | |
258 | ||
259 | :: | |
260 | ||
c07f9fc5 | 261 | cephfs-data-scan scan_extents --alternate-pool recovery --filesystem <original filesystem name> <original data pool name> |
7c673cae | 262 | cephfs-data-scan scan_inodes --alternate-pool recovery --filesystem <original filesystem name> --force-corrupt --force-init <original data pool name> |
c07f9fc5 | 263 | cephfs-data-scan scan_links --filesystem recovery-fs |
7c673cae FG |
264 | |
265 | If the damaged filesystem contains dirty journal data, it may be recovered next | |
266 | with: | |
267 | ||
268 | :: | |
269 | ||
270 | cephfs-journal-tool --rank=<original filesystem name>:0 event recover_dentries list --alternate-pool recovery | |
271 | cephfs-journal-tool --rank recovery-fs:0 journal reset --force | |
272 | ||
c07f9fc5 FG |
273 | After recovery, some recovered directories will have incorrect statistics. |
274 | Ensure the parameters mds_verify_scatter and mds_debug_scatterstat are set | |
275 | to false (the default) to prevent the MDS from checking the statistics, then | |
276 | run a forward scrub to repair them. Ensure you have an MDS running and issue: | |
7c673cae FG |
277 | |
278 | :: | |
279 | ||
280 | ceph daemon mds.a scrub_path / recursive repair |