]> git.proxmox.com Git - ceph.git/blame - ceph/doc/cephfs/disaster-recovery-experts.rst
update source to Ceph Pacific 16.2.2
[ceph.git] / ceph / doc / cephfs / disaster-recovery-experts.rst
CommitLineData
11fdf7f2
TL
1
2.. _disaster-recovery-experts:
3
4Advanced: Metadata repair tools
5===============================
6
7.. warning::
8
9 If you do not have expert knowledge of CephFS internals, you will
10 need to seek assistance before using any of these tools.
11
12 The tools mentioned here can easily cause damage as well as fixing it.
13
14 It is essential to understand exactly what has gone wrong with your
9f95a23c 15 file system before attempting to repair it.
11fdf7f2
TL
16
17 If you do not have access to professional support for your cluster,
18 consult the ceph-users mailing list or the #ceph IRC channel.
19
20
21Journal export
22--------------
23
24Before attempting dangerous operations, make a copy of the journal like so:
25
26::
27
28 cephfs-journal-tool journal export backup.bin
29
30Note that this command may not always work if the journal is badly corrupted,
31in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).
32
33
34Dentry recovery from journal
35----------------------------
36
37If a journal is damaged or for any reason an MDS is incapable of replaying it,
38attempt to recover what file metadata we can like so:
39
40::
41
42 cephfs-journal-tool event recover_dentries summary
43
44This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.
45
46This command will write any inodes/dentries recoverable from the journal
47into the backing store, if these inodes/dentries are higher-versioned
48than the previous contents of the backing store. If any regions of the journal
49are missing/damaged, they will be skipped.
50
51Note that in addition to writing out dentries and inodes, this command will update
52the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers
53are now in use. In simple cases, this will result in an entirely valid backing
54store state.
55
56.. warning::
57
58 The resulting state of the backing store is not guaranteed to be self-consistent,
59 and an online MDS scrub will be required afterwards. The journal contents
60 will not be modified by this command, you should truncate the journal
61 separately after recovering what you can.
62
63Journal truncation
64------------------
65
66If the journal is corrupt or MDSs cannot replay it for any reason, you can
67truncate it like so:
68
69::
70
9f95a23c
TL
71 cephfs-journal-tool [--rank=N] journal reset
72
73Specify the MDS rank using the ``--rank`` option when the file system has/had
74multiple active MDS.
11fdf7f2
TL
75
76.. warning::
77
78 Resetting the journal *will* lose metadata unless you have extracted
79 it by other means such as ``recover_dentries``. It is likely to leave
80 some orphaned objects in the data pool. It may result in re-allocation
81 of already-written inodes, such that permissions rules could be violated.
82
83MDS table wipes
84---------------
85
86After the journal has been reset, it may no longer be consistent with respect
87to the contents of the MDS tables (InoTable, SessionMap, SnapServer).
88
89To reset the SessionMap (erase all sessions), use:
90
91::
92
93 cephfs-table-tool all reset session
94
95This command acts on the tables of all 'in' MDS ranks. Replace 'all' with an MDS
96rank to operate on that rank only.
97
98The session table is the table most likely to need resetting, but if you know you
99also need to reset the other tables then replace 'session' with 'snap' or 'inode'.
100
101MDS map reset
102-------------
103
9f95a23c 104Once the in-RADOS state of the file system (i.e. contents of the metadata pool)
11fdf7f2
TL
105is somewhat recovered, it may be necessary to update the MDS map to reflect
106the contents of the metadata pool. Use the following command to reset the MDS
107map to a single MDS:
108
109::
110
111 ceph fs reset <fs name> --yes-i-really-mean-it
112
113Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored:
114as a result it is possible for this to result in data loss.
115
116One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'. The
117key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such
118that it would overwrite any existing root inode on disk and orphan any existing files. In
119contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS
120daemon to claim the rank will go ahead and use the existing in-RADOS metadata.
121
122Recovery from missing metadata objects
123--------------------------------------
124
125Depending on what objects are missing or corrupt, you may need to
126run various commands to regenerate default versions of the
127objects.
128
129::
130
131 # Session table
132 cephfs-table-tool 0 reset session
133 # SnapServer
134 cephfs-table-tool 0 reset snap
135 # InoTable
136 cephfs-table-tool 0 reset inode
137 # Journal
138 cephfs-journal-tool --rank=0 journal reset
139 # Root inodes ("/" and MDS directory)
140 cephfs-data-scan init
141
142Finally, you can regenerate metadata objects for missing files
143and directories based on the contents of a data pool. This is
144a three-phase process. First, scanning *all* objects to calculate
145size and mtime metadata for inodes. Second, scanning the first
146object from every file to collect this metadata and inject it into
147the metadata pool. Third, checking inode linkages and fixing found
148errors.
149
150::
151
152 cephfs-data-scan scan_extents <data pool>
153 cephfs-data-scan scan_inodes <data pool>
154 cephfs-data-scan scan_links
155
156'scan_extents' and 'scan_inodes' commands may take a *very long* time
157if there are many files or very large files in the data pool.
158
159To accelerate the process, run multiple instances of the tool.
160
161Decide on a number of workers, and pass each worker a number within
162the range 0-(worker_m - 1).
163
164The example below shows how to run 4 workers simultaneously:
165
166::
167
168 # Worker 0
169 cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool>
170 # Worker 1
171 cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool>
172 # Worker 2
173 cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool>
174 # Worker 3
175 cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool>
176
177 # Worker 0
178 cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool>
179 # Worker 1
180 cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool>
181 # Worker 2
182 cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool>
183 # Worker 3
184 cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>
185
186It is **important** to ensure that all workers have completed the
187scan_extents phase before any workers enter the scan_inodes phase.
188
189After completing the metadata recovery, you may want to run cleanup
190operation to delete ancillary data geneated during recovery.
191
192::
193
194 cephfs-data-scan cleanup <data pool>
195
196
197
198Using an alternate metadata pool for recovery
199---------------------------------------------
200
201.. warning::
202
203 There has not been extensive testing of this procedure. It should be
204 undertaken with great care.
205
9f95a23c 206If an existing file system is damaged and inoperative, it is possible to create
f67539c2
TL
207a fresh metadata pool and attempt to reconstruct the file system metadata into
208this new pool, leaving the old metadata in place. This could be used to make a
209safer attempt at recovery since the existing metadata pool would not be
210modified.
11fdf7f2
TL
211
212.. caution::
213
214 During this process, multiple metadata pools will contain data referring to
215 the same data pool. Extreme caution must be exercised to avoid changing the
216 data pool contents while this is the case. Once recovery is complete, the
f67539c2 217 damaged metadata pool should be archived or deleted.
11fdf7f2 218
f67539c2
TL
219To begin, the existing file system should be taken down, if not done already,
220to prevent further modification of the data pool. Unmount all clients and then
221mark the file system failed:
222
223::
224
225 ceph fs fail <fs_name>
226
227Next, create a recovery file system in which we will populate a new metadata pool
228backed by the original data pool.
11fdf7f2
TL
229
230::
231
232 ceph fs flag set enable_multiple true --yes-i-really-mean-it
f67539c2
TL
233 ceph osd pool create cephfs_recovery_meta
234 ceph fs new cephfs_recovery recovery <data_pool> --allow-dangerous-metadata-overlay
235
236
237The recovery file system starts with an MDS rank that will initialize the new
238metadata pool with some metadata. This is necessary to bootstrap recovery.
239However, now we will take the MDS down as we do not want it interacting with
240the metadata pool further.
241
242::
243
244 ceph fs fail cephfs_recovery
11fdf7f2 245
f67539c2 246Next, we will reset the initial metadata the MDS created:
11fdf7f2
TL
247
248::
249
f67539c2
TL
250 cephfs-table-tool cephfs_recovery:all reset session
251 cephfs-table-tool cephfs_recovery:all reset snap
252 cephfs-table-tool cephfs_recovery:all reset inode
253
254Now perform the recovery of the metadata pool from the data pool:
255
256::
257
258 cephfs-data-scan init --force-init --filesystem cephfs_recovery --alternate-pool cephfs_recovery_meta
259 cephfs-data-scan scan_extents --alternate-pool cephfs_recovery_meta --filesystem <fs_name> <data_pool>
260 cephfs-data-scan scan_inodes --alternate-pool cephfs_recovery_meta --filesystem <fs_name> --force-corrupt <data_pool>
261 cephfs-data-scan scan_links --filesystem cephfs_recovery
262
263.. note::
264
265 Each scan procedure above goes through the entire data pool. This may take a
266 significant amount of time. See the previous section on how to distribute
267 this task among workers.
11fdf7f2 268
9f95a23c 269If the damaged file system contains dirty journal data, it may be recovered next
11fdf7f2
TL
270with:
271
272::
273
f67539c2
TL
274 cephfs-journal-tool --rank=<fs_name>:0 event recover_dentries list --alternate-pool cephfs_recovery_meta
275 cephfs-journal-tool --rank cephfs_recovery:0 journal reset --force
11fdf7f2
TL
276
277After recovery, some recovered directories will have incorrect statistics.
f67539c2
TL
278Ensure the parameters ``mds_verify_scatter`` and ``mds_debug_scatterstat`` are
279set to false (the default) to prevent the MDS from checking the statistics:
280
281::
282
283 ceph config rm mds mds_verify_scatter
284 ceph config rm mds mds_debug_scatterstat
285
286(Note, the config may also have been set globally or via a ceph.conf file.)
287Now, allow an MDS to join the recovery file system:
288
289::
290
291 ceph fs set cephfs_recovery joinable true
292
293Finally, run a forward :doc:`scrub </cephfs/scrub>` to repair the statistics.
294Ensure you have an MDS running and issue:
11fdf7f2
TL
295
296::
297
f67539c2
TL
298 ceph fs status # get active MDS
299 ceph tell mds.<id> scrub start / recursive repair
300
301.. note::
302
303 Symbolic links are recovered as empty regular files. `Symbolic link recovery
304 <https://tracker.ceph.com/issues/46166>`_ is scheduled to be supported in
305 Pacific.
306
307It is recommended to migrate any data from the recovery file system as soon as
308possible. Do not restore the old file system while the recovery file system is
309operational.
9f95a23c
TL
310
311.. note::
312
f67539c2
TL
313 If the data pool is also corrupt, some files may not be restored because
314 backtrace information is lost. If any data objects are missing (due to
315 issues like lost Placement Groups on the data pool), the recovered files
316 will contain holes in place of the missing data.
9f95a23c 317
f67539c2 318.. _Symbolic link recovery: https://tracker.ceph.com/issues/46166