]> git.proxmox.com Git - ceph.git/blob - ceph/doc/cephfs/disaster-recovery-experts.rst
343ecfc0bffbbaa979a5abb6896d1105643110b1
[ceph.git] / ceph / doc / cephfs / disaster-recovery-experts.rst
1
2 .. _disaster-recovery-experts:
3
4 Advanced: Metadata repair tools
5 ===============================
6
7 .. warning::
8
9 If you do not have expert knowledge of CephFS internals, you will
10 need to seek assistance before using any of these tools.
11
12 The tools mentioned here can easily cause damage as well as fixing it.
13
14 It is essential to understand exactly what has gone wrong with your
15 file system before attempting to repair it.
16
17 If you do not have access to professional support for your cluster,
18 consult the ceph-users mailing list or the #ceph IRC channel.
19
20
21 Journal export
22 --------------
23
24 Before attempting dangerous operations, make a copy of the journal like so:
25
26 ::
27
28 cephfs-journal-tool journal export backup.bin
29
30 Note that this command may not always work if the journal is badly corrupted,
31 in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).
32
33
34 Dentry recovery from journal
35 ----------------------------
36
37 If a journal is damaged or for any reason an MDS is incapable of replaying it,
38 attempt to recover what file metadata we can like so:
39
40 ::
41
42 cephfs-journal-tool event recover_dentries summary
43
44 This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.
45
46 This command will write any inodes/dentries recoverable from the journal
47 into the backing store, if these inodes/dentries are higher-versioned
48 than the previous contents of the backing store. If any regions of the journal
49 are missing/damaged, they will be skipped.
50
51 Note that in addition to writing out dentries and inodes, this command will update
52 the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers
53 are now in use. In simple cases, this will result in an entirely valid backing
54 store state.
55
56 .. warning::
57
58 The resulting state of the backing store is not guaranteed to be self-consistent,
59 and an online MDS scrub will be required afterwards. The journal contents
60 will not be modified by this command, you should truncate the journal
61 separately after recovering what you can.
62
63 Journal truncation
64 ------------------
65
66 If the journal is corrupt or MDSs cannot replay it for any reason, you can
67 truncate it like so:
68
69 ::
70
71 cephfs-journal-tool [--rank=N] journal reset
72
73 Specify the MDS rank using the ``--rank`` option when the file system has/had
74 multiple active MDS.
75
76 .. warning::
77
78 Resetting the journal *will* lose metadata unless you have extracted
79 it by other means such as ``recover_dentries``. It is likely to leave
80 some orphaned objects in the data pool. It may result in re-allocation
81 of already-written inodes, such that permissions rules could be violated.
82
83 MDS table wipes
84 ---------------
85
86 After the journal has been reset, it may no longer be consistent with respect
87 to the contents of the MDS tables (InoTable, SessionMap, SnapServer).
88
89 To reset the SessionMap (erase all sessions), use:
90
91 ::
92
93 cephfs-table-tool all reset session
94
95 This command acts on the tables of all 'in' MDS ranks. Replace 'all' with an MDS
96 rank to operate on that rank only.
97
98 The session table is the table most likely to need resetting, but if you know you
99 also need to reset the other tables then replace 'session' with 'snap' or 'inode'.
100
101 MDS map reset
102 -------------
103
104 Once the in-RADOS state of the file system (i.e. contents of the metadata pool)
105 is somewhat recovered, it may be necessary to update the MDS map to reflect
106 the contents of the metadata pool. Use the following command to reset the MDS
107 map to a single MDS:
108
109 ::
110
111 ceph fs reset <fs name> --yes-i-really-mean-it
112
113 Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored:
114 as a result it is possible for this to result in data loss.
115
116 One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'. The
117 key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such
118 that it would overwrite any existing root inode on disk and orphan any existing files. In
119 contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS
120 daemon to claim the rank will go ahead and use the existing in-RADOS metadata.
121
122 Recovery from missing metadata objects
123 --------------------------------------
124
125 Depending on what objects are missing or corrupt, you may need to
126 run various commands to regenerate default versions of the
127 objects.
128
129 ::
130
131 # Session table
132 cephfs-table-tool 0 reset session
133 # SnapServer
134 cephfs-table-tool 0 reset snap
135 # InoTable
136 cephfs-table-tool 0 reset inode
137 # Journal
138 cephfs-journal-tool --rank=0 journal reset
139 # Root inodes ("/" and MDS directory)
140 cephfs-data-scan init
141
142 Finally, you can regenerate metadata objects for missing files
143 and directories based on the contents of a data pool. This is
144 a three-phase process. First, scanning *all* objects to calculate
145 size and mtime metadata for inodes. Second, scanning the first
146 object from every file to collect this metadata and inject it into
147 the metadata pool. Third, checking inode linkages and fixing found
148 errors.
149
150 ::
151
152 cephfs-data-scan scan_extents <data pool>
153 cephfs-data-scan scan_inodes <data pool>
154 cephfs-data-scan scan_links
155
156 'scan_extents' and 'scan_inodes' commands may take a *very long* time
157 if there are many files or very large files in the data pool.
158
159 To accelerate the process, run multiple instances of the tool.
160
161 Decide on a number of workers, and pass each worker a number within
162 the range 0-(worker_m - 1).
163
164 The example below shows how to run 4 workers simultaneously:
165
166 ::
167
168 # Worker 0
169 cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool>
170 # Worker 1
171 cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool>
172 # Worker 2
173 cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool>
174 # Worker 3
175 cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool>
176
177 # Worker 0
178 cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool>
179 # Worker 1
180 cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool>
181 # Worker 2
182 cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool>
183 # Worker 3
184 cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>
185
186 It is **important** to ensure that all workers have completed the
187 scan_extents phase before any workers enter the scan_inodes phase.
188
189 After completing the metadata recovery, you may want to run cleanup
190 operation to delete ancillary data generated during recovery.
191
192 ::
193
194 cephfs-data-scan cleanup <data pool>
195
196
197
198 Using an alternate metadata pool for recovery
199 ---------------------------------------------
200
201 .. warning::
202
203 There has not been extensive testing of this procedure. It should be
204 undertaken with great care.
205
206 If an existing file system is damaged and inoperative, it is possible to create
207 a fresh metadata pool and attempt to reconstruct the file system metadata into
208 this new pool, leaving the old metadata in place. This could be used to make a
209 safer attempt at recovery since the existing metadata pool would not be
210 modified.
211
212 .. caution::
213
214 During this process, multiple metadata pools will contain data referring to
215 the same data pool. Extreme caution must be exercised to avoid changing the
216 data pool contents while this is the case. Once recovery is complete, the
217 damaged metadata pool should be archived or deleted.
218
219 To begin, the existing file system should be taken down, if not done already,
220 to prevent further modification of the data pool. Unmount all clients and then
221 mark the file system failed:
222
223 ::
224
225 ceph fs fail <fs_name>
226
227 Next, create a recovery file system in which we will populate a new metadata pool
228 backed by the original data pool.
229
230 ::
231
232 ceph fs flag set enable_multiple true --yes-i-really-mean-it
233 ceph osd pool create cephfs_recovery_meta
234 ceph fs new cephfs_recovery recovery <data_pool> --allow-dangerous-metadata-overlay
235
236
237 The recovery file system starts with an MDS rank that will initialize the new
238 metadata pool with some metadata. This is necessary to bootstrap recovery.
239 However, now we will take the MDS down as we do not want it interacting with
240 the metadata pool further.
241
242 ::
243
244 ceph fs fail cephfs_recovery
245
246 Next, we will reset the initial metadata the MDS created:
247
248 ::
249
250 cephfs-table-tool cephfs_recovery:all reset session
251 cephfs-table-tool cephfs_recovery:all reset snap
252 cephfs-table-tool cephfs_recovery:all reset inode
253
254 Now perform the recovery of the metadata pool from the data pool:
255
256 ::
257
258 cephfs-data-scan init --force-init --filesystem cephfs_recovery --alternate-pool cephfs_recovery_meta
259 cephfs-data-scan scan_extents --alternate-pool cephfs_recovery_meta --filesystem <fs_name> <data_pool>
260 cephfs-data-scan scan_inodes --alternate-pool cephfs_recovery_meta --filesystem <fs_name> --force-corrupt <data_pool>
261 cephfs-data-scan scan_links --filesystem cephfs_recovery
262
263 .. note::
264
265 Each scan procedure above goes through the entire data pool. This may take a
266 significant amount of time. See the previous section on how to distribute
267 this task among workers.
268
269 If the damaged file system contains dirty journal data, it may be recovered next
270 with:
271
272 ::
273
274 cephfs-journal-tool --rank=<fs_name>:0 event recover_dentries list --alternate-pool cephfs_recovery_meta
275 cephfs-journal-tool --rank cephfs_recovery:0 journal reset --force
276
277 After recovery, some recovered directories will have incorrect statistics.
278 Ensure the parameters ``mds_verify_scatter`` and ``mds_debug_scatterstat`` are
279 set to false (the default) to prevent the MDS from checking the statistics:
280
281 ::
282
283 ceph config rm mds mds_verify_scatter
284 ceph config rm mds mds_debug_scatterstat
285
286 (Note, the config may also have been set globally or via a ceph.conf file.)
287 Now, allow an MDS to join the recovery file system:
288
289 ::
290
291 ceph fs set cephfs_recovery joinable true
292
293 Finally, run a forward :doc:`scrub </cephfs/scrub>` to repair the statistics.
294 Ensure you have an MDS running and issue:
295
296 ::
297
298 ceph fs status # get active MDS
299 ceph tell mds.<id> scrub start / recursive repair
300
301 .. note::
302
303 Symbolic links are recovered as empty regular files. `Symbolic link recovery
304 <https://tracker.ceph.com/issues/46166>`_ is scheduled to be supported in
305 Pacific.
306
307 It is recommended to migrate any data from the recovery file system as soon as
308 possible. Do not restore the old file system while the recovery file system is
309 operational.
310
311 .. note::
312
313 If the data pool is also corrupt, some files may not be restored because
314 backtrace information is lost. If any data objects are missing (due to
315 issues like lost Placement Groups on the data pool), the recovered files
316 will contain holes in place of the missing data.
317
318 .. _Symbolic link recovery: https://tracker.ceph.com/issues/46166