7 The notes in this section are aimed at experts, making a best effort
8 to recovery what they can from damaged filesystems. These steps
9 have the potential to make things worse as well as better. If you
10 are unsure, do not proceed.
16 Before attempting dangerous operations, make a copy of the journal like so:
20 cephfs-journal-tool journal export backup.bin
22 Note that this command may not always work if the journal is badly corrupted,
23 in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).
26 Dentry recovery from journal
27 ----------------------------
29 If a journal is damaged or for any reason an MDS is incapable of replaying it,
30 attempt to recover what file metadata we can like so:
34 cephfs-journal-tool event recover_dentries summary
36 This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.
38 This command will write any inodes/dentries recoverable from the journal
39 into the backing store, if these inodes/dentries are higher-versioned
40 than the previous contents of the backing store. If any regions of the journal
41 are missing/damaged, they will be skipped.
43 Note that in addition to writing out dentries and inodes, this command will update
44 the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers
45 are now in use. In simple cases, this will result in an entirely valid backing
50 The resulting state of the backing store is not guaranteed to be self-consistent,
51 and an online MDS scrub will be required afterwards. The journal contents
52 will not be modified by this command, you should truncate the journal
53 separately after recovering what you can.
58 If the journal is corrupt or MDSs cannot replay it for any reason, you can
63 cephfs-journal-tool journal reset
67 Resetting the journal *will* lose metadata unless you have extracted
68 it by other means such as ``recover_dentries``. It is likely to leave
69 some orphaned objects in the data pool. It may result in re-allocation
70 of already-written inodes, such that permissions rules could be violated.
75 After the journal has been reset, it may no longer be consistent with respect
76 to the contents of the MDS tables (InoTable, SessionMap, SnapServer).
78 To reset the SessionMap (erase all sessions), use:
82 cephfs-table-tool all reset session
84 This command acts on the tables of all 'in' MDS ranks. Replace 'all' with an MDS
85 rank to operate on that rank only.
87 The session table is the table most likely to need resetting, but if you know you
88 also need to reset the other tables then replace 'session' with 'snap' or 'inode'.
93 Once the in-RADOS state of the filesystem (i.e. contents of the metadata pool)
94 is somewhat recovered, it may be necessary to update the MDS map to reflect
95 the contents of the metadata pool. Use the following command to reset the MDS
100 ceph fs reset <fs name> --yes-i-really-mean-it
102 Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored:
103 as a result it is possible for this to result in data loss.
105 One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'. The
106 key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such
107 that it would overwrite any existing root inode on disk and orphan any existing files. In
108 contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS
109 daemon to claim the rank will go ahead and use the existing in-RADOS metadata.
111 Recovery from missing metadata objects
112 --------------------------------------
114 Depending on what objects are missing or corrupt, you may need to
115 run various commands to regenerate default versions of the
121 cephfs-table-tool 0 reset session
123 cephfs-table-tool 0 reset snap
125 cephfs-table-tool 0 reset inode
127 cephfs-journal-tool --rank=0 journal reset
128 # Root inodes ("/" and MDS directory)
129 cephfs-data-scan init
131 Finally, you can regenerate metadata objects for missing files
132 and directories based on the contents of a data pool. This is
133 a two-phase process. First, scanning *all* objects to calculate
134 size and mtime metadata for inodes. Second, scanning the first
135 object from every file to collect this metadata and inject
136 it into the metadata pool.
140 cephfs-data-scan scan_extents <data pool>
141 cephfs-data-scan scan_inodes <data pool>
143 This command may take a *very long* time if there are many
144 files or very large files in the data pool.
146 To accelerate the process, run multiple instances of the tool.
148 Decide on a number of workers, and pass each worker a number within
149 the range 0-(worker_m - 1).
151 The example below shows how to run 4 workers simultaneously:
156 cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool>
158 cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool>
160 cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool>
162 cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool>
165 cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool>
167 cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool>
169 cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool>
171 cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>
173 It is **important** to ensure that all workers have completed the
174 scan_extents phase before any workers enter the scan_inodes phase.
176 After completing the metadata recovery, you may want to run cleanup
177 operation to delete ancillary data geneated during recovery.
181 cephfs-data-scan cleanup <data pool>
183 Finding files affected by lost data PGs
184 ---------------------------------------
186 Losing a data PG may affect many files. Files are split into many objects,
187 so identifying which files are affected by loss of particular PGs requires
188 a full scan over all object IDs that may exist within the size of a file.
189 This type of scan may be useful for identifying which files require
190 restoring from a backup.
194 This command does not repair any metadata, so when restoring files in
195 this case you must *remove* the damaged file, and replace it in order
196 to have a fresh inode. Do not overwrite damaged files in place.
198 If you know that objects have been lost from PGs, use the ``pg_files``
199 subcommand to scan for files that may have been damaged as a result:
203 cephfs-data-scan pg_files <path> <pg id> [<pg id>...]
205 For example, if you have lost data from PGs 1.4 and 4.5, and you would like
206 to know which files under /home/bob might have been damaged:
210 cephfs-data-scan pg_files /home/bob 1.4 4.5
212 The output will be a list of paths to potentially damaged files, one
215 Note that this command acts as a normal CephFS client to find all the
216 files in the filesystem and read their layouts, so the MDS must be
219 Using an alternate metadata pool for recovery
220 ---------------------------------------------
224 There has not been extensive testing of this procedure. It should be
225 undertaken with great care.
227 If an existing filesystem is damaged and inoperative, it is possible to create
228 a fresh metadata pool and attempt to reconstruct the filesystem metadata
229 into this new pool, leaving the old metadata in place. This could be used to
230 make a safer attempt at recovery since the existing metadata pool would not be
235 During this process, multiple metadata pools will contain data referring to
236 the same data pool. Extreme caution must be exercised to avoid changing the
237 data pool contents while this is the case. Once recovery is complete, the
238 damaged metadata pool should be deleted.
240 To begin this process, first create the fresh metadata pool and initialize
241 it with empty file system data structures:
245 ceph fs flag set enable_multiple true --yes-i-really-mean-it
246 ceph osd pool create recovery <pg-num> replicated <crush-ruleset-name>
247 ceph fs new recovery-fs recovery <data pool> --allow-dangerous-metadata-overlay
248 cephfs-data-scan init --force-init --filesystem recovery-fs --alternate-pool recovery
249 ceph fs reset recovery-fs --yes-i-realy-mean-it
250 cephfs-table-tool recovery-fs:all reset session
251 cephfs-table-tool recovery-fs:all reset snap
252 cephfs-table-tool recovery-fs:all reset inode
254 Next, run the recovery toolset using the --alternate-pool argument to output
255 results to the alternate pool:
259 cephfs-data-scan scan_extents --alternate-pool recovery --filesystem <original filesystem name>
260 cephfs-data-scan scan_inodes --alternate-pool recovery --filesystem <original filesystem name> --force-corrupt --force-init <original data pool name>
262 If the damaged filesystem contains dirty journal data, it may be recovered next
267 cephfs-journal-tool --rank=<original filesystem name>:0 event recover_dentries list --alternate-pool recovery
268 cephfs-journal-tool --rank recovery-fs:0 journal reset --force
270 After recovery, some recovered directories will have incorrect link counts.
271 Ensure the parameter mds_debug_scatterstat is set to false (the default) to
272 prevent the MDS from checking the link counts, then run a forward scrub to
273 repair them. Ensure you have an MDS running and issue:
277 ceph daemon mds.a scrub_path / recursive repair